solr查询工作原理深入内幕

1.什么是Lucene？

作为一个开放源代码项目，Lucene从问世之后，引发了开放源代码社群的巨大反响，程序员们不仅使用它构建具体的全文检索应用，而且将之集成到各种系统软件中去，以及构建Web应用，甚至某些商业软件也采用了Lucene作为其内部全文检索子系统的核心。apache软件基金会的网站使用了Lucene作为全文检索的引擎，IBM的开源软件eclipse的2.1版本中也采用了Lucene作为帮助子系统的全文索引引擎，相应的IBM的商业软件Web Sphere中也采用了Lucene。Lucene以其开放源代码的特性、优异的索引结构、良好的系统架构获得了越来越多的应用。

Lucene作为一个全文检索引擎，其具有如下突出的优点：

（1）索引文件格式独立于应用平台。Lucene定义了一套以8位字节为基础的索引文件格式，使得兼容系统或者不同平台的应用能够共享建立的索引文件。

（2）在传统全文检索引擎的倒排索引的基础上，实现了分块索引，能够针对新的文件建立小文件索引，提升索引速度。然后通过与原有索引的合并，达到优化的目的。

（3）优秀的面向对象的系统架构，使得对于Lucene扩展的学习难度降低，方便扩充新功能。

（4）设计了独立于语言和文件格式的文本分析接口，索引器通过接受Token流完成索引文件的创立，用户扩展新的语言和文件格式，只需要实现文本分析的接口。

（5）已经默认实现了一套强大的查询引擎，用户无需自己编写代码即使系统可获得强大的查询能力，Lucene的查询实现中默认实现了布尔操作、模糊查询（Fuzzy Search）、分组查询等等。

2.什么是solr？

为什么要solr：

　　solr是将整个索引操作功能封装好了的搜索引擎系统(企业级搜索引擎产品)

　　solr可以部署到单独的服务器上(WEB服务)，它可以提供服务，我们的业务系统就只要发送请求，接收响应即可，降低了业务系统的负载

　　solr部署在专门的服务器上，它的索引库就不会受业务系统服务器存储空间的限制

　　solr支持分布式集群，索引服务的容量和能力可以线性扩展

solr的工作机制：

　　solr就是在lucene工具包的基础之上进行了封装，而且是以web服务的形式对外提供索引功能

　　业务系统需要使用到索引的功能（建索引，查索引）时，只要发出http请求，并将返回数据进行解析即可

Solr 是Apache下的一个顶级开源项目，采用Java开发，它是基于Lucene的全文搜索服务器。Solr提供了比Lucene更为丰富的查询语言，同时实现了可配置、可扩展，并对索引、搜索性能进行了优化。

Solr可以独立运行，运行在Jetty、Tomcat等这些Servlet容器中，Solr 索引的实现方法很简单，用 POST 方法向 Solr 服务器发送一个描述 Field 及其内容的 XML 文档，Solr根据xml文档添加、删除、更新索引。Solr 搜索只需要发送 HTTP GET 请求，然后对 Solr 返回Xml、json等格式的查询结果进行解析，组织页面布局。Solr不提供构建UI的功能，Solr提供了一个管理界面，通过管理界面可以查询Solr的配置和运行情况。

3.lucene和solr的关系

solr是门户，lucene是底层基础，solr和lucene的关系正如hadoop和hdfs的关系。

4.Jetty是什么？

　　Jetty 是一个开源的servlet容器，它为基于Java的web容器，例如JSP和servlet提供运行环境。Jetty是使用Java语言编写的，它的API以一组JAR包的形式发布。开发人员可以将Jetty容器实例化成一个对象，可以迅速为一些独立运行（stand-alone）的Java应用提供网络和web连接。

5.流程概况

6.Jetty接收请求并处理

设置本地调试见<lucene-solr本地调试方法>所示

StartSolrJetty.java

public static void main( String[] args )

  {

    //System.setProperty("solr.solr.home", "../../../example/solr");

    Server server = new Server();

    ServerConnector connector = new ServerConnector(server, new HttpConnectionFactory());

    // Set some timeout options to make debugging easier.

    connector.setIdleTimeout(1000 * 60 * 60);

    connector.setSoLingerTime(-1);

    connector.setPort(8983);

    server.setConnectors(new Connector[] { connector });

    WebAppContext bb = new WebAppContext();

    bb.setServer(server);

    bb.setContextPath("/solr");

    bb.setWar("solr/webapp/web");

//    // START JMX SERVER

//    if( true ) {

//      MBeanServer mBeanServer = ManagementFactory.getPlatformMBeanServer();

//      MBeanContainer mBeanContainer = new MBeanContainer(mBeanServer);

//      server.getContainer().addEventListener(mBeanContainer);

//      mBeanContainer.start();

//    }

    server.setHandler(bb);

    try {

      System.out.println(">>> STARTING EMBEDDED JETTY SERVER, PRESS ANY KEY TO STOP");

      server.start();

      while (System.in.available() == 0) {

        Thread.sleep(5000);

      }

      server.stop();

      server.join();

    }

    catch (Exception e) {

      e.printStackTrace();

      System.exit(100);

    }

  }

其中，Server是http服务器，聚合了Connector(http请求接收器)和请求处理器Hanlder，Server本身是一个handler和一个线程池，Connector使用线程池来调用handle方法。

/** Jetty HTTP Servlet Server.

 * This class is the main class for the Jetty HTTP Servlet server.

 * It aggregates Connectors (HTTP request receivers) and request Handlers.

 * The server is itself a handler and a ThreadPool.  Connectors use the ThreadPool methods

 * to run jobs that will eventually call the handle method.

 */

其工作流程如下图所示

因其不是本文重点，故略去不述。

7.solr调用lucene过程

上篇文章<solr调用lucene底层实现倒排索引源码解析>已经论述，可对照上面的整体流程图进行解读，故略去不述

8.lucene调用过程

从上图可以看出分两个阶段

8.1 创建Weight

8.1.1 创建BooleanWeight

BooleanWeight.java

  BooleanWeight(BooleanQuery query, IndexSearcher searcher, boolean needsScores, float boost) throws IOException {

    super(query);

    this.query = query;

    this.needsScores = needsScores;

    this.similarity = searcher.getSimilarity(needsScores);

    weights = new ArrayList<>();

    for (BooleanClause c : query) {

      Query q=c.getQuery();

      Weight w = searcher.createWeight(q, needsScores && c.isScoring(), boost);

      weights.add(w);

    }

  }

8.1.2 同义词权重分析

SynonymQuery.java

  @Override

  public Weight createWeight(IndexSearcher searcher, boolean needsScores, float boost) throws IOException {

    if (needsScores) {

      return new SynonymWeight(this, searcher, boost);

    } else {

      // if scores are not needed, let BooleanWeight deal with optimizing that case.

      BooleanQuery.Builder bq = new BooleanQuery.Builder();

      for (Term term : terms) {

        bq.add(new TermQuery(term), BooleanClause.Occur.SHOULD);

      }

      return searcher.rewrite(bq.build()).createWeight(searcher, needsScores, boost);

    }

  }

8.1.3 TermQuery.java

  @Override

  public Weight createWeight(IndexSearcher searcher, boolean needsScores, float boost) throws IOException {

    final IndexReaderContext context = searcher.getTopReaderContext();

    final TermContext termState;

    if (perReaderTermState == null

        || perReaderTermState.wasBuiltFor(context) == false) {

      if (needsScores) {

        // make TermQuery single-pass if we don't have a PRTS or if the context

        // differs!

        termState = TermContext.build(context, term);

      } else {

        // do not compute the term state, this will help save seeks in the terms

        // dict on segments that have a cache entry for this query

        termState = null;

      }

    } else {

      // PRTS was pre-build for this IS

      termState = this.perReaderTermState;

    }

    return new TermWeight(searcher, needsScores, boost, termState);

  }

调用TermWeight,计算CollectionStatistics和TermStatistics

    public TermWeight(IndexSearcher searcher, boolean needsScores,

        float boost, TermContext termStates) throws IOException {

      super(TermQuery.this);

      if (needsScores && termStates == null) {

        throw new IllegalStateException("termStates are required when scores are needed");

      }

      this.needsScores = needsScores;

      this.termStates = termStates;

      this.similarity = searcher.getSimilarity(needsScores);

      final CollectionStatistics collectionStats;

      final TermStatistics termStats;

      if (needsScores) {

        termStates.setQuery(this.getQuery().getKeyword());

        collectionStats = searcher.collectionStatistics(term.field());

        termStats = searcher.termStatistics(term, termStates);

      } else {

        // we do not need the actual stats, use fake stats with docFreq=maxDoc and ttf=-1

        final int maxDoc = searcher.getIndexReader().maxDoc();

        collectionStats = new CollectionStatistics(term.field(), maxDoc, -1, -1, -1);

        termStats = new TermStatistics(term.bytes(), maxDoc, -1,term.bytes());

      }

      this.stats = similarity.computeWeight(boost, collectionStats, termStats);

    }

调用Similarity的computeWeight

BM25Similarity.java

  @Override

  public final SimWeight computeWeight(float boost, CollectionStatistics collectionStats, TermStatistics... termStats) {

    Explanation idf = termStats.length == 1 ? idfExplain(collectionStats, termStats[0]) : idfExplain(collectionStats, termStats);

    float avgdl = avgFieldLength(collectionStats);

    float[] oldCache = new float[256];

    float[] cache = new float[256];

    for (int i = 0; i < cache.length; i++) {

      oldCache[i] = k1 * ((1 - b) + b * OLD_LENGTH_TABLE[i] / avgdl);

      cache[i] = k1 * ((1 - b) + b * LENGTH_TABLE[i] / avgdl);

    }

    return new BM25Stats(collectionStats.field(), boost, idf, avgdl, oldCache, cache);

  }

8.2 查询过程

完整过程如下：IndexSearcher调用search方法

  protected void search(List<LeafReaderContext> leaves, Weight weight, Collector collector)

      throws IOException {

    // TODO: should we make this

    // threaded...?  the Collector could be sync'd?

    // always use single thread:

    for (LeafReaderContext ctx : leaves) { // search each subreader

      final LeafCollector leafCollector;

      try {

        leafCollector = collector.getLeafCollector(ctx);//1

      } catch (CollectionTerminatedException e) {

        // there is no doc of interest in this reader context

        // continue with the following leaf

        continue;

      }

      BulkScorer scorer = weight.bulkScorer(ctx);//2

      if (scorer != null) {

        try {

          scorer.score(leafCollector, ctx.reader().getLiveDocs());//3

        } catch (CollectionTerminatedException e) {

          // collection was terminated prematurely

          // continue with the following leaf

        }

      }

    }

  }

8.2.1 获取Collector

TopScoreDocCollector.java#SimpleTopScoreDocCollector

    @Override

    public LeafCollector getLeafCollector(LeafReaderContext context)

        throws IOException {

      final int docBase = context.docBase;

      return new ScorerLeafCollector() {

        @Override

        public void collect(int doc) throws IOException {

          float score = scorer.score();

/*          Document document=context.reader().document(doc);

*/

          // This collector cannot handle these scores:

          assert score != Float.NEGATIVE_INFINITY;

          assert !Float.isNaN(score);

          totalHits++;

          if (score <= pqTop.score) {

            // Since docs are returned in-order (i.e., increasing doc Id), a document

            // with equal score to pqTop.score cannot compete since HitQueue favors

            // documents with lower doc Ids. Therefore reject those docs too.

            return;

          }

          pqTop.doc = doc + docBase;

          pqTop.score = score;

          pqTop = pq.updateTop();

        }

      };

    }

8.2.2 调用打分socore

  /**

   * Optional method, to return a {@link BulkScorer} to

   * score the query and send hits to a {@link Collector}.

   * Only queries that have a different top-level approach

   * need to override this; the default implementation

   * pulls a normal {@link Scorer} and iterates and

   * collects the resulting hits which are not marked as deleted.

   *

   * @param context

   *          the {@link org.apache.lucene.index.LeafReaderContext} for which to return the {@link Scorer}.

   *

   * @return a {@link BulkScorer} which scores documents and

   * passes them to a collector.

   * @throws IOException if there is a low-level I/O error

   */

  public BulkScorer bulkScorer(LeafReaderContext context) throws IOException {

    Scorer scorer = scorer(context);

    if (scorer == null) {

      // No docs match

      return null;

    }

    // This impl always scores docs in order, so we can

    // ignore scoreDocsInOrder:

    return new DefaultBulkScorer(scorer);

  }

  /** Just wraps a Scorer and performs top scoring using it.

   *  @lucene.internal */

  protected static class DefaultBulkScorer extends BulkScorer {

    private final Scorer scorer;

    private final DocIdSetIterator iterator;

    private final TwoPhaseIterator twoPhase;

    /** Sole constructor. */

    public DefaultBulkScorer(Scorer scorer) {

      if (scorer == null) {

        throw new NullPointerException();

      }

      this.scorer = scorer;

      this.iterator = scorer.iterator();

      this.twoPhase = scorer.twoPhaseIterator();

    }

    @Override

    public long cost() {

      return iterator.cost();

    }

    @Override

    public int score(LeafCollector collector, Bits acceptDocs, int min, int max) throws IOException {

      collector.setScorer(scorer);

      if (scorer.docID() == -1 && min == 0 && max == DocIdSetIterator.NO_MORE_DOCS) {

        scoreAll(collector, iterator, twoPhase, acceptDocs);

        return DocIdSetIterator.NO_MORE_DOCS;

      } else {

        int doc = scorer.docID();

        if (doc < min) {

          if (twoPhase == null) {

            doc = iterator.advance(min);

          } else {

            doc = twoPhase.approximation().advance(min);

          }

        }

        return scoreRange(collector, iterator, twoPhase, acceptDocs, doc, max);

      }

    }

调用scoreAll方法，遍历Document，执行SimpleTopScoreDocCollector的collect方法，包含打分逻辑<见SimpleTopScoreDocCollector代码>。

    /** Specialized method to bulk-score all hits; we

     *  separate this from {@link #scoreRange} to help out

     *  hotspot.

     *  See <a href="https://issues.apache.org/jira/browse/LUCENE-5487">LUCENE-5487</a> */

    static void scoreAll(LeafCollector collector, DocIdSetIterator iterator, TwoPhaseIterator twoPhase, Bits acceptDocs) throws IOException {

      if (twoPhase == null) {

        for (int doc = iterator.nextDoc(); doc != DocIdSetIterator.NO_MORE_DOCS; doc = iterator.nextDoc()) {

          if (acceptDocs == null || acceptDocs.get(doc)) {

            collector.collect(doc);

          }

        }

      } else {

        // The scorer has an approximation, so run the approximation first, then check acceptDocs, then confirm

        final DocIdSetIterator approximation = twoPhase.approximation();

        for (int doc = approximation.nextDoc(); doc != DocIdSetIterator.NO_MORE_DOCS; doc = approximation.nextDoc()) {

          if ((acceptDocs == null || acceptDocs.get(doc)) && twoPhase.matches()) {

            collector.collect(doc);

          }

        }

      }

    }

总结：

　　梳理整理整个流程太累了。

参考资料

【1】http://www.blogjava.net/hoojo/archive/2012/09/06/387140.html

【2】https://baike.baidu.com/item/jetty/370234?fr=aladdin

solr查询工作原理深入内幕的更多相关文章

Solr Principal - 工作原理/机制
From http://lucene.apache.org/solr/guide/7_1/overview-of-documents-fields-and-schema-design.html The ...
DNS查询的工作原理
二.DNS查询的工作原理 1.DNS查询过程按两部分进行 1.名称查询从客户端计算机开始, 并传送给本机的DNS客户服务程序进行解析 2.如果不能再本机解析查询, 可根据设定的查询DN ...
数据库SQL SELECT查询的工作原理
一般开发员只会应用SQL的四条经典语句:select,insert,delete,update.但是我从来没有研究过它们的工作原理,这篇我想说一说select在数据库中的工作原理. B/S架构中最经典 ...
笔面试复习（spring常用.jar包/事务/控制反转/bean对象管理和创建/springMVC工作原理/sql查询）
###spring常用jar包1.spring.jar是包含有完整发布模块的单个jar包.2.org.springframework.aop包含在应用中使用Spring的AOP特性时所需要的类.3.o ...
（24）ASP.NET Core EF查询（查询的工作原理、跟踪与非跟踪查询）
1.查询生命周期在进入正题时候,我们先来了解EF Core查询的生命周期. 1.1LINQ查询会由Entity Framework Core处理并生成给数据库提供程序可处理的表示形式(说白了就是生成 ...
Elasticsearch工作原理
一.关于搜索引擎各位知道,搜索程序一般由索引链及搜索组件组成. 索引链功能的实现需要按照几个独立的步骤依次完成:检索原始内容.根据原始内容来创建对应的文档.对创建的文档进行索引. 搜索组件用于接收用 ...
【Oracle 集群】ORACLE DATABASE 11G RAC 知识图文详细教程之RAC 工作原理和相关组件（三）
RAC 工作原理和相关组件(三) 概述:写下本文档的初衷和动力,来源于上篇的<oracle基本操作手册>.oracle基本操作手册是作者研一假期对oracle基础知识学习的汇总.然后形成体 ...
【原】Learning Spark (Python版) 学习笔记(三)----工作原理、调优与Spark SQL
周末的任务是更新Learning Spark系列第三篇,以为自己写不完了,但为了改正拖延症,还是得完成给自己定的任务啊 = =.这三章主要讲Spark的运行过程(本地+集群),性能调优以及Spark ...
浏览器内部工作原理--作者：Tali Garsiel
本篇内容为转载,主要用于个人学习使用,作者:Tali Garsiel 一.介绍浏览器可以被认为是使用最广泛的软件,本文将介绍浏览器的工作原理,我们将看到,从你在地址栏输入google.com到你看到 ...

随机推荐

redis学习-散列表常用命令（hash）
redis学习-散列表常用命令(hash) hset,hmset:给指定散列表插入一个或者多个键值对 hget,hmget:获取指定散列表一个或者多个键值对的值 hgetall:获取所欲哦键值以及 ...
vuex创建store并用computed获取数据
vuex中的store是一个状态管理器,用于分发数据.相当于父组件数据传递给子组件. 1.安装vuex npm i vuex --save 2.在src目录中创建store文件夹,里面创建store. ...
突破内网限制上网（ssh+polipo）
最近到客户这里来做项目,发现客户对网络的把控实在严格,很多网站都不能访问到,搜索到的技术文档也屏蔽了.突然想到了FQ工具的原理,刚好自己也有台服务器在外头,部署个Polipo代理然后用ssh隧道连接. ...
deepCopy深拷贝
function deepCopy(p,c){ var c = c || {}; for ( var i in p ){ //确保属于自己的属性 if ( p.hasOwnProperty( i ) ...
REdis主挂掉后复制节点才起来会如何？
结论: 这种情况下复制节点(即从节点)无法提升为主节点,复制节点会一直尝试和主节点建立连接,直接成功.主节点恢复后,复制节点仍然保持为复制节点,并不会成为主节点. 复制节点无法提升为主节点的原因是复制 ...
uva10256(计算几何)
省选前练模板系列: #include<cmath> #include<cstdio> #include<cstring> #include<iostream& ...
hadoop2.7集群安装
1. 按照官方文档对单节点的配置,将etc/hadoop/core-site.xml中的localhost改成node13. http://hadoop.apache.org/docs/r2.7.3/ ...
C# WebAPI系列(2)
上篇中简单介绍了一下WebApi,本章主要介绍一下Controller相关的知识. 在实际应用中,Controller是WebAPI的链接服务器和客户端的窗口.Controller的好坏影响整个系统的 ...
Node.js中实现套接字服务
后端服务的一个重要的部分是通过套接字进行通信的能力. 套接字允许一个进程通过一个IP地址和端口与另一个进程通信同一个服务器上的两个不同进程的进程间通信(IPC)或者访问一个完全不同的服务器上运行的 ...
@ResponseBody 返回乱码的解决办法
1:最快的最简单的办法是在Ajax请求脸面指定头信息Accept属性,StringHttpMessageConverter默认iso-8859-1编码,但是会根据请求头信息指定的编码格式来转换 ...

solr查询工作原理深入内幕

solr查询工作原理深入内幕的更多相关文章

随机推荐

热门专题