Why DocValues?

The standard way that Solr builds the index is with an inverted index. This style builds a list of terms found in all the documents in the index and next to each term is a list of documents that the term appears in (as well as how many times the term appears in that document). This makes search very fast - since users search by terms, having a ready list of term-to-document values makes the query process faster.

For other features that we now commonly associate with search, such as sorting, faceting, and highlighting, this approach is not very efficient. The faceting engine, for example, must look up each term that appears in each document that will make up the result set and pull the document IDs in order to build the facet list. In Solr, this is maintained in memory, and can be slow to load (depending on the number of documents, terms, etc.).

In Lucene 4.0, a new approach was introduced. DocValue fields are now column-oriented fields with a document-to-value mapping built at index time. This approach promises to relieve some of the memory requirements of the fieldCache and make lookups for faceting, sorting, and grouping much faster.

From day one Apache Lucene provided a solid inverted index datastructure and the ability to store the text and binary chunks in stored field. In a typical usecase the inverted index is used to retrieve & score documents matching one or more terms. Once the matching documents have been scored stored fields are loaded for the top N documents for display purposes. So far so good! However, the retrieval process is essentially limited to the information available in the inverted index like term & document frequency, boosts and normalization factors. So what if you need custom information to score or filter documents? Stored fields are designed for bulk read, meaning the perform best if you load all their data while during document retrieval we need more fine grained data.

Lucene provides a RAM resident FieldCache built from the inverted index once the FieldCache for a specific field is requested the first time or during index reopen. Internally we call this process un-inverting the field since the inverted index is a value to document mapping and FieldCache is a document to value datastructure. For simplicity think of an array indexed by Lucene’s internal documents ID. When the FieldCache is loaded Lucene iterates all terms in a field, parses the terms values and fills the arrays slots based on the document IDs associated with the term. Figure 1. illustrats the process.

Figure 1. Univerting a field to FieldCache

FieldCache serves very well for its purpose since accessing a value is basically doing a constant time array look. However, there are special cases where other datastructures are used in FieldCache but those are out of scope in this post.

摘自:http://blog.trifork.com/2011/10/27/introducing-lucene-index-doc-values/

Low Level Details

Lucene has four underlying types that a docvalues field can have. Currently Solr uses three of these:

  1. NUMERIC: a single-valued per-document numeric type. This is like having a large long[] array for the whole index, though the data is compressed based upon the values that are actually used.

    • For example, consider 3 documents with these values:

             doc[0] = 1005
      doc[1] = 1006
      doc[2] = 1005

      In this example the field would use around 1 bit per document, since that is all that is needed.

  2. SORTED: a single-valued per-document string type. This is like having a large String[] array for the whole index, but with an additional level of indirection. Each unique value is assigned a term number that represents its ordinal value. So each document really stores a compressed integer, and separately there is a "dictionary" mapping these term numbers back to term values.
    • For example, consider 3 documents with these values:

             doc[0] = "aardvark"
      doc[1] = "beaver"
      doc[2] = "aardvark"

      Value "aardvark" will be assigned ordinal 0, and "beaver" 1, creating these two data structures:

             doc[0] = 0
      doc[1] = 1
      doc[2] = 0 term[0] = "aardvark"
      term[1] = "beaver"
  3. SORTED_SET: a multi-valued per-document string type. Its similar to SORTED, except each document has a "set" of values (in increasing sorted order). So it intentionally discards duplicate values (frequency) within a document and loses order within the document.
    • For example, consider 3 documents with these values:

             doc[0] = "cat", "aardvark", "beaver", "aardvark"
      doc[1] =
      doc[2] = "cat"

      Value "aardvark" will be assigned ordinal 0, "beaver" 1, and "cat" 2, creating these two data structures:

             doc[0] = [0, 1, 2]
      doc[1] = []
      doc[2] = [2] term[0] = "aardvark"
      term[1] = "beaver"
      term[2] = "cat"
  4. BINARY: a single-valued per-document byte[] array. This can be used for encoding custom per-document datastructures.

lucene DocValues——本质是为通过docID查找某field的值 看图的更多相关文章

  1. lucene DocValues——本质是为通过docID查找某field的值

    什么是docValues? docValues是一种记录doc字段值的一种形式,在例如在结果排序和统计Facet查询时,需要通过docid取字段值的场景下是非常高效的. 为什么要使用docValues ...

  2. C语言:将ss所指字符串中所有下标为奇数位上的字母转换成大写,若不是字母,则不转换。-删除指针p所指字符串中的所有空白字符(包括制表符,回车符,换行符)-在带头结点的单向链表中,查找数据域中值为ch的结点,找到后通过函数值返回该结点在链表中所处的顺序号,

    //将ss所指字符串中所有下标为奇数位上的字母转换成大写,若不是字母,则不转换. #include <stdio.h> #include <string.h> void fun ...

  3. lucene DocValues——没有看懂

    前言: 在Lucene4.x之后,出现一个重大的特性,就是索引支持DocValues,这对于广大的solr和elasticsearch用户,无疑来说是一个福音,这玩意的出现通过牺牲一定的磁盘空间带来的 ...

  4. Elasticsearch压缩索引——lucene倒排索引本质是列存储+使用嵌套文档可以大幅度提高压缩率

    注意:由于是重复数据,词法不具有通用性!文章价值不大! 摘自:https://segmentfault.com/a/1190000002695169 Doc Values 会压缩存储重复的内容. 给定 ...

  5. 421. Maximum XOR of Two Numbers in an Array——本质:利用trie数据结构查找

    Given a non-empty array of numbers, a0, a1, a2, - , an-1, where 0 ≤ ai < 231. Find the maximum re ...

  6. 【游戏开发】小白学Lua——从Lua查找表元素的过程看元表、元方法

    引言 在上篇博客中,我们简单地学习了一下Lua的基本语法.其实在Lua中有一个还有一个叫元表的概念,不得不着重地探讨一下.元表在实际地开发中,也是会被极大程度地所使用到.本篇博客,就让我们从Lua查找 ...

  7. 使用FindControl("id")查找控件 返回值都是Null的问题

    做了一个通过字符串ID查找页面控件并且给页面控件赋值的功能,过程中遇到了this.FindControl("id")返回值都是Null的问题,记录一下解决办法. 问题的原因是我所要 ...

  8. mysql查找以逗号分隔的值-find_in_set

    有了FIND_IN_SET这个函数.我们可以设计一个如:一只手机即是智能机,又是Andriod系统的. 比如:有个产品表里有一个type字段,他存储的是产品(手机)类型,有 1.智能机,2.Andri ...

  9. C#算法设计查找篇之03-插值查找

    插值查找(Interpolation Search) 该文章的最新版本已迁移至个人博客[比特飞],单击链接 https://www.byteflying.com/archives/701 访问. 插值 ...

随机推荐

  1. Python 双向队列Deque、单向队列Queue 模块使用详解

    Python 双向队列Deque 模块使用详解 创建双向队列Deque序列 双向队列Deque提供了类似list的操作方法: #!/usr/bin/python3 import collections ...

  2. python025 Python3 正则表达式

    Python3 正则表达式 正则表达式是一个特殊的字符序列,它能帮助你方便的检查一个字符串是否与某种模式匹配. Python 自1.5版本起增加了re 模块,它提供 Perl 风格的正则表达式模式. ...

  3. php面向对象(设计模式 工厂模式)

    //设计模式//单例模式//类的计划生育//让该类在外界无法造成对象//让外界可以造一个对象,做一个静态方法返回对象//在累里面可以通过静态变量控制返回对象只能有一个 //class Cat//{// ...

  4. 局部二值模式(Local Binary Patterns)纹理灰度与旋转不变性

    Multiresolution Gray Scale and Rotation Invariant Texture Classification with Local Binary Patterns, ...

  5. 【收藏】实战Nginx与PHP(FastCGI)的安装、配置与优化

    拜读南非蚂蚁大牛的文章真是有所收获 http://ixdba.blog.51cto.com/2895551/806622 一.什么是 FastCGI FastCGI是一个可伸缩地.高速地在HTTP s ...

  6. 2017"百度之星"程序设计大赛 - 初赛(B)度度熊的交易计划

    n个村庄m条带权路,权值为花费,村庄可以造东西卖东西,造完东西可以换地方卖,给出每个村庄造东西花费a和最多个数b.卖东西价值c和最多个数d,求最大收益. 裸的费用流.然而还WA了一发.很好. 建源向每 ...

  7. C#的特性学习草稿

    原文发布时间为:2008-11-22 -- 来源于本人的百度文章 [由搬家工具导入] 举个简单的例子: 先定义个特性 从Attribute继承,并标明用法 [AttributeUsage(Attrib ...

  8. 一份关于webpack2和模块打包的新手指南(一)

    webpack已成为现代Web开发中最重要的工具之一.它是一个用于JavaScript的模块打包工具,但是它也可以转换所有的前端资源,例如HTML和CSS,甚至是图片.它可以让你更好地控制应用程序所产 ...

  9. 一起talk C栗子吧(第一百回:C语言实例--使用信号量进行进程间同步与相互排斥一)

    各位看官们.大家好,上一回中咱们说的是进程间同步与相互排斥的样例,这一回咱们说的样例是:使用信号量进行进程间同步与相互排斥. 闲话休提,言归正转.让我们一起talk C栗子吧! 看官们,信号量是由著名 ...

  10. HDU 1017 A Mathematical Curiosity【看懂题意+穷举法】

    //2014.10.17    01:19 //题意: //先输入一个数N,然后分块输入,每块输入每次2个数,n,m,直到n,m同一时候为零时  //结束,当a和b满足题目要求时那么这对a和b就是一组 ...