lucene底层数据结构——FST，针对field使用列存储，delta encode压缩doc ids数组，LZ4压缩算法

参考：

http://www.slideshare.net/lucenerevolution/what-is-inaluceneagrandfinal

http://www.slideshare.net/jpountz/how-does-lucene-store-your-data

http://www.infoq.com/cn/articles/database-timestamp-02?utm_source=infoq&utm_medium=related_content_link&utm_campaign=relatedContent_articles_clk

摘录一些重要的：

看一下Lucene的倒排索引是怎么构成的。

我们来看一个实际的例子，假设有如下的数据：

docid	年龄	性别
1	18	女
2	20	女
3	18	男

这里每一行是一个document。每个document都有一个docid。那么给这些document建立的倒排索引就是：

年龄

18	[1,3]
20	[2]

性别

女	[1,2]
男	[3]

可以看到，倒排索引是per field的，一个字段有一个自己的倒排索引。18,20这些叫做 term，而[1,3]就是posting list。Posting list就是一个int的数组，存储了所有符合某个term的文档id。那么什么是term dictionary 和 term index？

那么什么是term dictionary 和 term index？

假设我们有很多个term，比如：

Carla,Sara,Elin,Ada,Patty,Kate,Selena

如果按照这样的顺序排列，找出某个特定的term一定很慢，因为term没有排序，需要全部过滤一遍才能找出特定的term。排序之后就变成了：

Ada,Carla,Elin,Kate,Patty,Sara,Selena

这样我们可以用二分查找的方式，比全遍历更快地找出目标的term。这个就是 term dictionary。有了term dictionary之后，可以用 logN 次磁盘查找得到目标。但是磁盘的随机读操作仍然是非常昂贵的（一次random access大概需要10ms的时间）。所以尽量少的读磁盘，有必要把一些数据缓存到内存里。但是整个term dictionary本身又太大了，无法完整地放到内存里。于是就有了term index。term index有点像一本字典的大的章节表。比如：

A开头的term ……………. Xxx页

C开头的term ……………. Xxx页

E开头的term ……………. Xxx页

如果所有的term都是英文字符的话，可能这个term index就真的是26个英文字符表构成的了。但是实际的情况是，term未必都是英文字符，term可以是任意的byte数组。而且26个英文字符也未必是每一个字符都有均等的term，比如x字符开头的term可能一个都没有，而s开头的term又特别多。实际的term index是一棵trie 树：

例子是一个包含 "A", "to", "tea", "ted", "ten", "i", "in", 和 "inn" 的 trie 树。这棵树不会包含所有的term，它包含的是term的一些前缀。通过term index可以快速地定位到term dictionary的某个offset，然后从这个位置再往后顺序查找。再加上一些压缩技术（搜索 Lucene Finite State Transducers） term index 的尺寸可以只有所有term的尺寸的几十分之一，使得用内存缓存整个term index变成可能。整体上来说就是这样的效果。

现在我们可以回答“为什么Elasticsearch/Lucene检索可以比mysql快了。Mysql只有term dictionary这一层，是以b-tree排序的方式存储在磁盘上的。检索一个term需要若干次的random access的磁盘操作。而Lucene在term dictionary的基础上添加了term index来加速检索，term index以树的形式缓存在内存中。从term index查到对应的term dictionary的block位置之后，再去磁盘上找term，大大减少了磁盘的random access次数。

额外值得一提的两点是：term index在内存中是以FST（finite state transducers）的形式保存的，其特点是非常节省内存。Term dictionary在磁盘上是以分block的方式保存的，一个block内部利用公共前缀压缩，比如都是Ab开头的单词就可以把Ab省去。这样term dictionary可以比b-tree更节约磁盘空间。

--------------------------------------------------------

lucene并非使用Tree structure
– sorted for range queries
– O(log(n)) search

而是如下核心的数据结构，FST，delta encode压缩数组，列存储，LZ4压缩算法：
●Terms index: map a term prefix to a block in the dict ○ FST: automaton with weighted arcs, compact thanks to shared prefixes/suffixes 核心数据结构，本质是前后缀共享的状态机，类似trie来搜索用户输入的某个单词是否能搜到，搜到的话就跳转到Terms dictionary里去，搜到的结果是单词在terms dict里的offset（本质是数组的偏移量）
Lookup the term in the terms index
– In-memory FST storing terms prefixes
– Gives the offset to look at in the terms dictionary
– Can fast-fail if no terms have this prefix
●Terms dictionary: statistics + pointer in postings lists, Store terms and documents in arrays – binary search
• Jump to the given offset in the terms dictionary
– compressed based on shared prefixes, similarly to a burst trie
– called the “BlockTree terms dict”
• read sequentially until the term is found
●Postings lists: encodes matching docs in sorted order ○ + positions + offsets 倒排的文档ID都在此
• Jump to the given offset in the postings lists
• Encoded using modified FOR (Frame of Reference) delta
– 1. delta-encode
– 2. split into block of N=128 values
– 3. bit packing per block
– 4. if remaining docs, encode with vInt
●Stored fields
• In-memory index for a subset of the doc ids
– memory-efficient thanks to monotonic compression
– searched using binary search
• Stored fields
– stored sequentially
– compressed (LZ4) in 16+KB blocks

Query execution：
• 2 disk seeks per field for search
• 1 disk seek per doc for stored fields
• It is common that the terms dict / postings lists fits into the file-system cache
• “Pulse” optimization
– For unique terms (freq=1), postings are inlined in the terms dict
– Only 1 disk seek
– Will always be used for your primary keys

插入新数据：
Insertion = write a new segment 一直写信segment可以防止使用锁
• Merge segments when there are too many of them
– concatenate docs, merge terms dicts and postings lists (merge sort!)
删除：
Deletion = turn a bit off
• Ignore deleted documents when searching and merging (reclaims space)
• Merge policies favor segments with many deletions

优缺点：
Updates require writing a new segment
– single-doc updates are costly, bulk updates preferred
– writes are sequential
• Segments are never modified in place
– filesystem-cache-friendly
– lock-free!
• Terms are deduplicated
– saves space for high-freq terms
• Docs are uniquely identified by an ord
– useful for cross-API communication
– Lucene can use several indexes in a single query
• Terms are uniquely identified by an ord
– important for sorting: compare longs, not strings
– important for faceting (more on this later)

针对field使用列存储：
Per doc and per field single numeric values, stored in a column-stride fashion
• Useful for sorting and custom scoring
• Norms are numeric doc values

一些设计原则：
• Save file handles
– don’t use one file per field or per doc
• Avoid disk seeks whenever possible
– disk seek on spinning disk is ~10 ms
• BUT don’t ignore the filesystem cache
– random access in small files is fine
• Light compression helps
– less I/O
– smaller indexes
– filesystem-cache-friendly

针对Compression techniques的数据结构：FSTs LZ4

lucene底层数据结构——FST，针对field使用列存储，delta encode压缩doc ids数组，LZ4压缩算法的更多相关文章

Lucene核心数据结构——FST存词典，跳表存倒排或者roarning bitmap 见另外一个文章
Lucene实现倒排表没有使用bitmap,为了效率,lucene使用了一些策略,具体如下:1. 使用FST保存词典,FST可以实现快速的Seek,这种结构在当查询可以表达成自动机时(PrefixQu ...
lucene底层数据结构——底层filter bitset原理，时间序列数据压缩将同一时间数据压缩为一行
如何联合索引查询? 所以给定查询过滤条件 age=18 的过程就是先从term index找到18在term dictionary的大概位置,然后再从term dictionary里精确地找到18这个 ...
Redis学习笔记（二）redis 底层数据结构
在上一节提到的图中,我们知道,可以通过 redisObject 对象的 type 和 encoding 属性.可以决定Redis 主要的底层数据结构:SDS.QuickList.ZipList.Has ...
SQL Server 2014聚集列存储索引
转发请注明引用和原文博客(http://www.cnblogs.com/wenBlog) 简介之前已经写过两篇介绍列存储索引的文章,但是只有非聚集列存储索引,今天再来简单介绍一下聚集的列存储索引,也 ...
应运而生！双11当天处理数据5PB—HiStore助力打造全球最大列存储数据库
阿里巴巴电商业务中历史数据存储与查询相关业务, 大量采用基于列存储技术的HiStore数据库,双11当天HiStore引擎处理数据记录超过6万亿条.原始存储数据量超过5PB.从单日数据处理量上看,该系 ...
SQL Server 列存储索引概述
第一次接触ColumnStore是在2017年,数据库环境是SQL Server 2012,Microsoft开始在SQL Server 2012中推广列存储索引,到现在的SQL Server 201 ...
SQL Server 列存储索引强化
SQL Server 列存储索引强化 SQL Server 列存储索引强化 1. 概述 2.背景 2.1 索引存储 2.2 缓存和I/O 2.3 Batch处理方式 3 聚集索引 3.1 提高索引创建 ...
SQL Server 2012 列存储索引分析（翻译）
一.概述列存储索引是SQL Server 2012中为提高数据查询的性能而引入的一个新特性,顾名思义,数据以列的方式存储在页中,不同于聚集索引.非聚集索引及堆表等以行为单位的方式存储.因为它并不要求 ...
SQL Server 2012 列存储索引分析（转载）
一.概述列存储索引是SQL Server 2012中为提高数据查询的性能而引入的一个新特性,顾名思义,数据以列的方式存储在页中,不同于聚集索引.非聚集索引及堆表等以行为单位的方式存储.因为它并不要求 ...

随机推荐

3----lua的数据转换及运算符
lua的基本数据类型转换转换成字符串 tostring( ... ) 可以将布尔类型和数字类型的值转换为字符串类型的值 local num=1; print(type(num)) newNum = ...
Linux中变量#,@,0,1,2,*,$$,$?的含义
$# 是传给脚本的参数个数 $ 是脚本本身的名字 $ 是传递给该shell脚本的第一个参数 $ 是传递给该shell脚本的第二个参数 $@ 是传给脚本的所有参数的列表 $* 是以一个单字符串显示所有向 ...
个人阅读作业 The Last
对于软件工程M1/M2的总结: 假象-MO 在团队开发的前期,我感觉自己其实给了自己很多的期待,因为一直希望着自己可以在团队中担任一个角色,用自己的力量为团队多做事情,也给了其他人一些假象,那就是看起 ...
Java中String，StringBuffer，StringBuilder的区别及其使用
由于笔试面试经常会问到这个问题,所以在这里先把这些问题搞清楚. String:自JDK1.0开始即有,源码中对String的描述: "Strings are constant; their ...
poj1113Wall(凸包）
链接顺便整理出来一份自己看着比较顺眼的模板 #include <iostream> #include<cstdio> #include<cstring> #inc ...
C# ？（问号）的三个用处（转载）
public DateTime? StatusDateTime = null; 脑子便也出现个问号,这是什么意思呢?网上搜下,总结如下: 1. 可空类型修饰符(?): 引用类型可以使用空引用表示一个不 ...
(一)mtg3000常见操作
一.查看MTG3000主控板IP地址: 重启设备后一直跑到shell,用户名和密码都输入admin,然后输入en进入命令行界面,输入sh int可查看设备IP等信息. 2.升级app.web程序
LA 5135 Mining Your Own Business
求出 bcc 后再……根据大白书上的思路即可. 然后我用的是自定义的 stack 类模板: #include<cstdio> #include<cstring> #includ ...
java获取客户访问IP
原文:http://blog.csdn.net/mydwr/article/details/9357187 /** * 获取访问者IP * * 在一般情况下使用Request.getRemoteAdd ...
java.lang.UnsupportedClassVersionError: org/sonatype/nexus/bootstrap/jsw/JswLauncher : Unsupported major.minor version 51.0
jdk 版本不对,需要修改jdk的版本

lucene底层数据结构——FST，针对field使用列存储，delta encode压缩doc ids数组，LZ4压缩算法

lucene底层数据结构——FST，针对field使用列存储，delta encode压缩doc ids数组，LZ4压缩算法的更多相关文章

随机推荐

热门专题