Solr4.8.0源码分析(6)之非排序查询

上篇文章简单介绍了Solr的查询流程,本文开始将详细介绍下查询的细节。查询主要分为排序查询和非排序查询,由于两者走的是两个分支,所以本文先介绍下非排序的查询。

查询的流程主要在SolrIndexSearch.getDocListC(QueryResult qr, QueryCommand cmd),顾名思义该函数对queryResultCache进行处理,并根据查询条件选择进入排序查询还是非排序查询。

  1. 1  /**
    2 * getDocList version that uses+populates query and filter caches.
  2. * In the event of a timeout, the cache is not populated.
  3. */
  4. private void getDocListC(QueryResult qr, QueryCommand cmd) throws IOException {
  5. DocListAndSet out = new DocListAndSet();
  6. qr.setDocListAndSet(out);
  7. QueryResultKey key=null;
  8. int maxDocRequested = cmd.getOffset() + cmd.getLen(); //当有偏移的查询产生,Solr首先会获取cmd.getOffset()+cmd.getLen()个的doc id然后                                    //再根据偏移量获取子集,所以maxDocRequested是实际的查询个数。
  9. // check for overflow, and check for # docs in index
  10. if (maxDocRequested < 0 || maxDocRequested > maxDoc()) maxDocRequested = maxDoc();// 最多的情况获取所有doc id
  11. int supersetMaxDoc= maxDocRequested;
  12. DocList superset = null;
  13.  
  14. int flags = cmd.getFlags();
  15. Query q = cmd.getQuery();
  16. if (q instanceof ExtendedQuery) {
  17. ExtendedQuery eq = (ExtendedQuery)q;
  18. if (!eq.getCache()) {
  19. flags |= (NO_CHECK_QCACHE | NO_SET_QCACHE | NO_CHECK_FILTERCACHE);
  20. }
  21. }
  22.  
  23. // we can try and look up the complete query in the cache.
  24. // we can't do that if filter!=null though (we don't want to
  25. // do hashCode() and equals() for a big DocSet).
    // 先从查询结果的缓存区查找是否出现过该条件的查询,若出现过则返回缓存的结果.关于缓存的内容将会独立写一篇文章
  26. if (queryResultCache != null && cmd.getFilter()==null
  27. && (flags & (NO_CHECK_QCACHE|NO_SET_QCACHE)) != ((NO_CHECK_QCACHE|NO_SET_QCACHE)))
  28. {
  29. // all of the current flags can be reused during warming,
  30. // so set all of them on the cache key.
  31. key = new QueryResultKey(q, cmd.getFilterList(), cmd.getSort(), flags);
  32. if ((flags & NO_CHECK_QCACHE)==0) {
  33. superset = queryResultCache.get(key);
  34.  
  35. if (superset != null) {
  36. // check that the cache entry has scores recorded if we need them
  37. if ((flags & GET_SCORES)==0 || superset.hasScores()) {
  38. // NOTE: subset() returns null if the DocList has fewer docs than
  39. // requested
  40. out.docList = superset.subset(cmd.getOffset(),cmd.getLen()); //如果有缓存,就从中去除一部分子集
  41. }
  42. }
  43. if (out.docList != null) {
  44. // found the docList in the cache... now check if we need the docset too.
  45. // OPT: possible future optimization - if the doclist contains all the matches,
  46. // use it to make the docset instead of rerunning the query.
    //获取缓存中的docSet,并传给result。
  47. if (out.docSet==null && ((flags & GET_DOCSET)!=0) ) {
  48. if (cmd.getFilterList()==null) {
  49. out.docSet = getDocSet(cmd.getQuery());
  50. } else {
  51. List<Query> newList = new ArrayList<>(cmd.getFilterList().size()+1);
  52. newList.add(cmd.getQuery());
  53. newList.addAll(cmd.getFilterList());
  54. out.docSet = getDocSet(newList);
  55. }
  56. }
  57. return;
  58. }
  59. }
  60.  
  61. // If we are going to generate the result, bump up to the
  62. // next resultWindowSize for better caching.
  63. // 修改supersetMaxDoc为queryResultWindwSize的整数倍
  64. if ((flags & NO_SET_QCACHE) == 0) {
  65. // handle 0 special case as well as avoid idiv in the common case.
  66. if (maxDocRequested < queryResultWindowSize) {
  67. supersetMaxDoc=queryResultWindowSize;
  68. } else {
  69. supersetMaxDoc = ((maxDocRequested -1)/queryResultWindowSize + 1)*queryResultWindowSize;
  70. if (supersetMaxDoc < 0) supersetMaxDoc=maxDocRequested;
  71. }
  72. } else {
  73. key = null; // we won't be caching the result
  74. }
  75. }
  76. cmd.setSupersetMaxDoc(supersetMaxDoc);
  77.  
  78. // OK, so now we need to generate an answer.
  79. // One way to do that would be to check if we have an unordered list
  80. // of results for the base query. If so, we can apply the filters and then
  81. // sort by the resulting set. This can only be used if:
  82. // - the sort doesn't contain score
  83. // - we don't want score returned.
  84.  
  85. // check if we should try and use the filter cache
  86. boolean useFilterCache=false;
  87. if ((flags & (GET_SCORES|NO_CHECK_FILTERCACHE))==0 && useFilterForSortedQuery && cmd.getSort() != null && filterCache != null) {
  88. useFilterCache=true;
  89. SortField[] sfields = cmd.getSort().getSort();
  90. for (SortField sf : sfields) {
  91. if (sf.getType() == SortField.Type.SCORE) {
  92. useFilterCache=false;
  93. break;
  94. }
  95. }
  96. }
  97.  
  98. if (useFilterCache) {
  99. // now actually use the filter cache.
  100. // for large filters that match few documents, this may be
  101. // slower than simply re-executing the query.
  102. if (out.docSet == null) {
  103. out.docSet = getDocSet(cmd.getQuery(),cmd.getFilter());
  104. DocSet bigFilt = getDocSet(cmd.getFilterList());
  105. if (bigFilt != null) out.docSet = out.docSet.intersection(bigFilt);
  106. }
  107. // todo: there could be a sortDocSet that could take a list of
  108. // the filters instead of anding them first...
  109. // perhaps there should be a multi-docset-iterator
  110. sortDocSet(qr, cmd); //排序查询
  111. } else {
  112. // do it the normal way...
  113. if ((flags & GET_DOCSET)!=0) {
  114. // this currently conflates returning the docset for the base query vs
  115. // the base query and all filters.
  116. DocSet qDocSet = getDocListAndSetNC(qr,cmd);
  117. // cache the docSet matching the query w/o filtering
  118. if (qDocSet!=null && filterCache!=null && !qr.isPartialResults()) filterCache.put(cmd.getQuery(),qDocSet);
  119. } else {
  120. getDocListNC(qr,cmd); //非排序查询,这也是本文的流程。
  121. }
  122. assert null != out.docList : "docList is null";
  123. }
  124.  
  125. if (null == cmd.getCursorMark()) {
  126. // Kludge...
  127. // we can't use DocSlice.subset, even though it should be an identity op
  128. // because it gets confused by situations where there are lots of matches, but
  129. // less docs in the slice then were requested, (due to the cursor)
  130. // so we have to short circuit the call.
  131. // None of which is really a problem since we can't use caching with
  132. // cursors anyway, but it still looks weird to have to special case this
  133. // behavior based on this condition - hence the long explanation.
  134. superset = out.docList; //根据offset和len截取查询结果
  135. out.docList = superset.subset(cmd.getOffset(),cmd.getLen());
  136. } else {
  137. // sanity check our cursor assumptions
  138. assert null == superset : "cursor: superset isn't null";
  139. assert 0 == cmd.getOffset() : "cursor: command offset mismatch";
  140. assert 0 == out.docList.offset() : "cursor: docList offset mismatch";
  141. assert cmd.getLen() >= supersetMaxDoc : "cursor: superset len mismatch: " +
  142. cmd.getLen() + " vs " + supersetMaxDoc;
  143. }
  144.  
  145. // lastly, put the superset in the cache if the size is less than or equal
  146. // to queryResultMaxDocsCached
  147. if (key != null && superset.size() <= queryResultMaxDocsCached && !qr.isPartialResults()) {
  148. queryResultCache.put(key, superset); //如果结果的个数小于或者等于queryResultMaxDocsCached则将本次查询结果放入缓存
  149. }
  150. }

进入非排序查询分支getDocListNC(),该函数内部分直接调用Lucene的IndexSearch.Search()

  1. final TopDocsCollector topCollector = buildTopDocsCollector(len, cmd); //新建TopDocsCollector对象,里面会新建(offset + len(查询条 //件的len))的HitQueue,每当获取到一个符合查询条件的doc,就会将该doc id放入HitQueue,并totalhit计数加一,这个totalhit变量也就是查询结果的数量
  2. Collector collector = topCollector;
  3. if (terminateEarly) {
  4. collector = new EarlyTerminatingCollector(collector, cmd.len);
  5. }
  6. if( timeAllowed > 0 ) {
  7. collector = new TimeLimitingCollector(collector, TimeLimitingCollector.getGlobalCounter(), timeAllowed);
    //TimeLimitingCollector的实现原理很简单,从第一个找到符合查询条件的doc id开始计时,在达到timeAllowed之前,会想查询得到的doc id放入HitQue //ue,一旦timeAllowed到了,就会立即扔出错误,中断后续的查询。这对于我们优化查询是个重要的提示
  8. }
  9. if (pf.postFilter != null) {
  10. pf.postFilter.setLastDelegate(collector);
  11. collector = pf.postFilter;
  12. }
  13. try {
    // 进入Lucene的IndexSearch.Search()
  14. super.search(query, luceneFilter, collector);
  15. if(collector instanceof DelegatingCollector) {
  16. ((DelegatingCollector)collector).finish();
  17. }
  18. }
  19. catch( TimeLimitingCollector.TimeExceededException x ) {
  20. log.warn( "Query: " + query + "; " + x.getMessage() );
  21. qr.setPartialResults(true);
  22. }
  23.  
  24. totalHits = topCollector.getTotalHits(); //返回totalhit的结果
  25. TopDocs topDocs = topCollector.topDocs(0, len); //返回优先级队列hitqueue的doc id
  26. populateNextCursorMarkFromTopDocs(qr, cmd, topDocs);
  27.  
  28. maxScore = totalHits>0 ? topDocs.getMaxScore() : 0.0f;
  29. nDocsReturned = topDocs.scoreDocs.length;
  30. ids = new int[nDocsReturned];
  31. scores = (cmd.getFlags()&GET_SCORES)!=0 ? new float[nDocsReturned] : null;
  32. for (int i=0; i<nDocsReturned; i++) {
  33. ScoreDoc scoreDoc = topDocs.scoreDocs[i];
  34. ids[i] = scoreDoc.doc;
  35. if (scores != null) scores[i] = scoreDoc.score;
  36. }
  1. TimeLimitingCollector统计查询结果的方法,一旦timeAllowed到了,就会立即扔出错误,中断后续的查询
  1. /**
  2. * Calls {@link Collector#collect(int)} on the decorated {@link Collector}
  3. * unless the allowed time has passed, in which case it throws an exception.
  4. *
  5. * @throws TimeExceededException
  6. * if the time allowed has exceeded.
  7. */
  8. @Override
  9. public void collect(final int doc) throws IOException {
  10. final long time = clock.get();
  11. if (timeout < time) {
  12. if (greedy) {
  13. //System.out.println(this+" greedy: before failing, collecting doc: "+(docBase + doc)+" "+(time-t0));
  14. collector.collect(doc);
  15. }
  16. //System.out.println(this+" failing on: "+(docBase + doc)+" "+(time-t0));
  17. throw new TimeExceededException( timeout-t0, time-t0, docBase + doc );
  18. }
  19. //System.out.println(this+" collecting: "+(docBase + doc)+" "+(time-t0));
  20. collector.collect(doc);
  21. }

接下来开始lucece的查询过程,

1. 首先会为每一个查询条件新建一个Weight的对象,最后将所有Weight对象放入ArrayList<Weight> weights。该过程给出每个查询条件的权重,并用于后续的评分过程。

  1. public BooleanWeight(IndexSearcher searcher, boolean disableCoord)
  2. throws IOException {
  3. this.similarity = searcher.getSimilarity();
  4. this.disableCoord = disableCoord;
  5. weights = new ArrayList<>(clauses.size());
  6. for (int i = 0 ; i < clauses.size(); i++) {
  7. BooleanClause c = clauses.get(i);
  8. Weight w = c.getQuery().createWeight(searcher);
  9. weights.add(w);
  10. if (!c.isProhibited()) {
  11. maxCoord++;
  12. }
  13. }
  14. }

2. 遍历所有sgement,一个接一个的查找符合查询条件的doc id。AtomicReaderContext 是包含segment的具体信息,包括doc base,num docs,这些信息室非常有用的,在实现查询优化时候很有帮助。这里需要注意的是这个collector是TopDocsCollector类型的对象,这在上面的代码中已经赋值过了。

  1. /**
  2. * Lower-level search API.
  3. *
  4. * <p>
  5. * {@link Collector#collect(int)} is called for every document. <br>
  6. *
  7. * <p>
  8. * NOTE: this method executes the searches on all given leaves exclusively.
  9. * To search across all the searchers leaves use {@link #leafContexts}.
  10. *
  11. * @param leaves
  12. * the searchers leaves to execute the searches on
  13. * @param weight
  14. * to match documents
  15. * @param collector
  16. * to receive hits
  17. * @throws BooleanQuery.TooManyClauses If a query would exceed
  18. * {@link BooleanQuery#getMaxClauseCount()} clauses.
  19. */
  20. protected void search(List<AtomicReaderContext> leaves, Weight weight, Collector collector)
  21. throws IOException {
  22.  
  23. // TODO: should we make this
  24. // threaded...? the Collector could be sync'd?
  25. // always use single thread:
  26. for (AtomicReaderContext ctx : leaves) { // search each subreader
  27. try {
  28. collector.setNextReader(ctx);
  29. } catch (CollectionTerminatedException e) {
  30. // there is no doc of interest in this reader context
  31. // continue with the following leaf
  32. continue;
  33. }
  34. BulkScorer scorer = weight.bulkScorer(ctx, !collector.acceptsDocsOutOfOrder(), ctx.reader().getLiveDocs());
  35. if (scorer != null) {
  36. try {
  37. scorer.score(collector);
  38. } catch (CollectionTerminatedException e) {
  39. // collection was terminated prematurely
  40. // continue with the following leaf
  41. }
  42. }
  43. }
  44. }

3. Weight.bulkScorer对查询条件进行评分,Lucene的多条件查询优化还是写的很不错的。Lucece会根据每个查询条件的词频对查询条件进行排序,词频小的排在前面,词频大的排在后面。这大大优化了多条件的查询。多条件查询的优化会在下文中详细介绍。

4. 最后Lucene会使用scorer.score(collector)这个过程真正的进行查询。看下Weight的两个函数,就能明白Lucene怎么进行查询统计。

  1. @Override
  2. public boolean score(Collector collector, int max) throws IOException {
  3. // TODO: this may be sort of weird, when we are
  4. // embedded in a BooleanScorer, because we are
  5. // called for every chunk of 2048 documents. But,
  6. // then, scorer is a FakeScorer in that case, so any
  7. // Collector doing something "interesting" in
  8. // setScorer will be forced to use BS2 anyways:
  9. collector.setScorer(scorer);
  10. if (max == DocIdSetIterator.NO_MORE_DOCS) {
  11. scoreAll(collector, scorer);
  12. return false;
  13. } else {
  14. int doc = scorer.docID();
  15. if (doc < 0) {
  16. doc = scorer.nextDoc();
  17. }
  18. return scoreRange(collector, scorer, doc, max);
  19. }
  20. }

Lucece会不停的从segment获取符合查询条件的doc,并放入collector的hitqueue里面。需要注意的是这里的collector是Collector类型,是TopDocsCollector等类的父类,所以scoreAll不仅能实现获取TopDocsCollector的doc is也能获取其他查询方式的doc id。

  1. static void scoreAll(Collector collector, Scorer scorer) throws IOException {
  2. int doc;
  3. while ((doc = scorer.nextDoc()) != DocIdSetIterator.NO_MORE_DOCS) {
  4. collector.collect(doc);
  5. }
  6. }

进入collector.collect(doc)查看TopDocsCollector的统计doc id的方式,就跟之前说的一样。

  1. @Override
  2. public void collect(int doc) throws IOException {
  3. float score = scorer.score();
  4.  
  5. // This collector cannot handle these scores:
  6. assert score != Float.NEGATIVE_INFINITY;
  7. assert !Float.isNaN(score);
  8.  
  9. totalHits++;
  10. if (score <= pqTop.score) {
  11. // Since docs are returned in-order (i.e., increasing doc Id), a document
  12. // with equal score to pqTop.score cannot compete since HitQueue favors
  13. // documents with lower doc Ids. Therefore reject those docs too.
  14. return;
  15. }
  16. pqTop.doc = doc + docBase;
  17. pqTop.score = score;
  18. pqTop = pq.updateTop();
  19. }
  1. 总结:本章详细的介绍了非排序查询的流程,主要涉及了以下几个类QueryComponent,SolrIndexSearch,TimeLimitingCollector,TopDocsCollector,IndexSearch,BulkScore,Weight. 篇幅原因,并没有将如何从segment里面获取doc id以及多条件查询是怎么实现的,这将是下一问多条件查询中详细介绍。

Solr4.8.0源码分析(6)之非排序查询的更多相关文章

  1. Solr4.8.0源码分析(25)之SolrCloud的Split流程

    Solr4.8.0源码分析(25)之SolrCloud的Split流程(一) 题记:昨天有位网友问我SolrCloud的split的机制是如何的,这个还真不知道,所以今天抽空去看了Split的原理,大 ...

  2. Solr4.8.0源码分析(24)之SolrCloud的Recovery策略(五)

    Solr4.8.0源码分析(24)之SolrCloud的Recovery策略(五) 题记:关于SolrCloud的Recovery策略已经写了四篇了,这篇应该是系统介绍Recovery策略的最后一篇了 ...

  3. Solr4.8.0源码分析(23)之SolrCloud的Recovery策略(四)

    Solr4.8.0源码分析(23)之SolrCloud的Recovery策略(四) 题记:本来计划的SolrCloud的Recovery策略的文章是3篇的,但是没想到Recovery的内容蛮多的,前面 ...

  4. Solr4.8.0源码分析(22)之SolrCloud的Recovery策略(三)

    Solr4.8.0源码分析(22)之SolrCloud的Recovery策略(三) 本文是SolrCloud的Recovery策略系列的第三篇文章,前面两篇主要介绍了Recovery的总体流程,以及P ...

  5. Solr4.8.0源码分析(21)之SolrCloud的Recovery策略(二)

    Solr4.8.0源码分析(21)之SolrCloud的Recovery策略(二) 题记:  前文<Solr4.8.0源码分析(20)之SolrCloud的Recovery策略(一)>中提 ...

  6. Solr4.8.0源码分析(20)之SolrCloud的Recovery策略(一)

    Solr4.8.0源码分析(20)之SolrCloud的Recovery策略(一) 题记: 我们在使用SolrCloud中会经常发现会有备份的shard出现状态Recoverying,这就表明Solr ...

  7. Solr4.8.0源码分析(14)之SolrCloud索引深入(1)

    Solr4.8.0源码分析(14) 之 SolrCloud索引深入(1) 上一章节<Solr In Action 笔记(4) 之 SolrCloud分布式索引基础>简要学习了SolrClo ...

  8. Solr4.8.0源码分析(15) 之 SolrCloud索引深入(2)

    Solr4.8.0源码分析(15) 之 SolrCloud索引深入(2) 上一节主要介绍了SolrCloud分布式索引的整体流程图以及索引链的实现,那么本节开始将分别介绍三个索引过程即LogUpdat ...

  9. Solr4.8.0源码分析(19)之缓存机制(二)

    Solr4.8.0源码分析(19)之缓存机制(二) 前文<Solr4.8.0源码分析(18)之缓存机制(一)>介绍了Solr缓存的生命周期,重点介绍了Solr缓存的warn过程.本节将更深 ...

随机推荐

  1. lesson10:hashmap变慢原因分析

    下面的英文描述了String.hashCode()方法,在特定情况下,返回值为0的问题: Java offers the HashMap and Hashtable classes, which us ...

  2. 专业DBA 遇到的问题集

    http://blog.csdn.net/mchdba/article/category/1596355/3

  3. 3 Ways to Preload Images with CSS, JavaScript, or Ajax---reference

    Preloading images is a great way to improve the user experience. When images are preloaded in the br ...

  4. java内存不足

    -Xmx1024m -Xms1024m -XX:PermSize=128m -XX:MaxPermSize=512m ------------------------- 亲测可用

  5. [转] 学习HTML/JavaScript/PHP 三者的关系以及各自的作用

    1.What is HTML? When you write a normal document using a word processor like Microsoft Word/Office, ...

  6. [转] TCP数据包重组实现分析

    PS: 这个实现对于某些特定情况未必是最佳实现,可以用数组来代替队列来实现 参照TCP/IP详解第二卷24~29章,详细论述了TCP协议的实现,大概总结一下TCP如何向应用层保证数据包的正确性.可靠性 ...

  7. 《Android开发艺术探索》读书笔记 (2) 第2章 IPC机制

    2.1 Android IPC简介 (1)任何一个操作系统都需要有相应的IPC机制,Linux上可以通过命名通道.共享内存.信号量等来进行进程间通信.Android系统不仅可以使用了Binder机制来 ...

  8. 使用jQuery Mobile和Phone Gap开发Android应用程序(转)

    经过了一段时间的学习,初步了解了该如何使用jQuery Mobile和 Phone Gap来开发一个Android应用程序,也想把这些东西介绍给大家. 1. 软件准备 要进行android app的开 ...

  9. jQuery代码优化 事件委托篇

    <转自 http://www.jb51.net/article/28770.htm> 参考文章:  解密jQuery事件核心 - 绑定设计(一) 参考文章:  解密jQuery事件核心 - ...

  10. python自学笔记

    python自学笔记 python自学笔记 1.输出 2.输入 3.零碎 4.数据结构 4.1 list 类比于java中的数组 4.2 tuple 元祖 5.条件判断和循环 5.1 条件判断 5.2 ...