https://www.elastic.co/cn/blog/frame-of-reference-and-roaring-bitmaps

http://roaringbitmap.org/

2015年2月18日Engineering

Frame of Reference and Roaring Bitmaps

作者

Postings lists

While it may surprise you if you are new to search engine internals, one of the most important building blocks of a search engine is the ability to efficiently compress and quickly decode sorted lists of integers. Why is this useful? As you may know, Elasticsearch shards, which are Lucene indices under the hood, split the data that they store into segments which are regularly merged together. Inside each segment, documents are given an identifier between 0 and the number of documents in the segment (up to 231-1). This is conceptually like an index in an array: it is stored nowhere but is enough to identity an item. Segments store data about documents sequentially, and a doc ID is the index of a document in a segment. So the first document in a segment would have a doc ID of 0, the second 1, etc. until the last document, which has a doc ID equal to the total number of documents in the segment minus one.

Why are these doc IDs useful? An inverted index needs to map terms to the list of documents that contain this term, called a postings list, and these doc IDs that we just discussed are a perfect fit since they can be compressed efficiently.

Frame Of Reference

In order to be able to compute intersections and unions efficiently, we require that these postings lists are sorted. A nice side-effect of this decision is that postings lists can be compressed with delta-encoding.

For instance, if your postings list is [73, 300, 302, 332, 343, 372], the list of deltas would be [73, 227, 2, 30, 11, 29]. What is interesting to note here is that all deltas are between 0 and 255, so you only need one byte per value. This is the technique that Lucene is using in order to encode your inverted index on disk: postings lists are split into blocks of 256 doc IDs and then each block is compressed separately using delta-encoding and bit packing: Lucene computes the maximum number of bits required to store deltas in a block, adds this information to the block header, and then encodes all deltas of the block using this number of bits. This encoding technique is known as Frame Of Reference (FOR) in the literature and has been used since Lucene 4.1.

Here is an example with a block size of 3 (instead of 256 in practice):

Frame of Reference and Roaring Bitmaps的更多相关文章

  1. OD: Register, Stack Frame, Function Reference

    几个重要的 Win32 寄存器 EIP 指令寄存器(Extended Instruction Pointer) 存放一个指针,指向下一条等待执行的指令地址 ESP 栈指针寄存器(Extended St ...

  2. Elasticsearch 通关教程(七): Elasticsearch 的性能优化

    硬件选择 Elasticsearch(后文简称 ES)的基础是 Lucene,所有的索引和文档数据是存储在本地的磁盘中,具体的路径可在 ES 的配置文件../config/elasticsearch. ...

  3. Elasticsearch 技术分析(九):Elasticsearch的使用和原理总结

    前言 之前已经分享过Elasticsearch的使用和原理的知识,由于近期在公司内部做了一次内部分享,所以本篇主要是基于之前的博文的一个总结,希望通过这篇文章能让读者大致了解Elasticsearch ...

  4. 全文搜索引擎Elasticsearch详细介绍

    我们生活中的数据总体分为两种:结构化数据 和 非结构化数据. 结构化数据:也称作行数据,是由二维表结构来逻辑表达和实现的数据,严格地遵循数据格式与长度规范,主要通过关系型数据库进行存储和管理.指具有固 ...

  5. L ==> E · L · K

    三剑客:Elastic Stack 在学习ELK前,先对 Lucene作基本了解. 今天才知道关系型数据库的索引是 B-Tree,罪过... 减少磁盘寻道次数 ---> 提高查询性能 Lucen ...

  6. 带你走进神一样的Elasticsearch索引机制

    更多精彩内容请看我的个人博客 前言 相比于大多数人熟悉的MySQL数据库的索引,Elasticsearch的索引机制是完全不同于MySQL的B+Tree结构.索引会被压缩放入内存用于加速搜索过程,这一 ...

  7. Busting Frame Busting: a Study of Clickjacking Vulnerabilities on Popular Sites

    Busting Frame Busting Reference From: http://seclab.stanford.edu/websec/framebusting/framebust.pdf T ...

  8. Frames of Reference参考框架

    Frames of Reference参考框架 When describing the position and orientation of something (for example, your ...

  9. Elasticsearch索引原理

    转载 http://blog.csdn.net/endlu/article/details/51720299 最近在参与一个基于Elasticsearch作为底层数据框架提供大数据量(亿级)的实时统计 ...

随机推荐

  1. .NET Core 使用MediatR CQRS模式 读写分离

    前言 CQRS(Command Query Responsibility Segregation)命令查询职责分离模式,它主要从我们业务系统中进行分离出我们(Command 增.删.改)和(Query ...

  2. 编程方式实现MySQL批量导入sql文件

    有时候需要在本地导入一些stage环境的数据到本地mysql,面对1000+的sql文件(包含表结构和数据,放在同一个文件夹下),使用navicat一个一个导入sql文件显然有点太慢了,于是考虑使用s ...

  3. 2021年第一个flag

    2021年开始更新本文列出的系列文章,根据书籍和自己的理解整理出spring框架的相关的学习 Spring 的设计理念和整体架构 学习目标 Spring的各个子项目 Spring的设计目标 Sprin ...

  4. Sentry(v20.12.1) K8S 云原生架构探索,玩转前/后端监控与事件日志大数据分析,高性能+高可用+可扩展+可伸缩集群部署

    Sentry 算是目前开源界集错误监控,日志打点上报,事件数据实时分析最好用的软件了,没有之一.将它部署到 Kubernetes,再搭配它本身自带的利用 Clickhouse (大数据实时分析引擎)构 ...

  5. zabbix v3.0安装部署【转】

    关于zabbix及相关服务软件版本: Linux:oracle linux 6.5 nginx:1.9.15 MySQL:5.5.49 PHP:5.5.35 一.安装nginx: 安装依赖包: yum ...

  6. vue 深度作用选择器

    使用 scoped 后,父组件的样式将不会渗透到子组件中 如果想在使用scoped,不污染全局的情况下,依然可以修改子组件样式,可以使用深度作用选择器 .tree{ width: 100%; floa ...

  7. Centos7 Nginx+PHP7 配置

    Centos7 Nginx+PHP7 配置 内容: 源码编译安装Nginx和PHP 配置PHP和Nginx,实现Nginx转发到PHP处理 测试 设置Nginx.PHP开机自启 安装的版本: Ngin ...

  8. linux网关服务器

    问题 多台服务器在内网网段,其中只有一台有公网ip可以上外网,需要让所有服务器都能连接外网 解决思路 使用路由转发的方式,将拥有公网ip的服务器搭建为网关服务器,即作为统一的公网出口 所谓转发即当主机 ...

  9. MyBatis初级实战之三:springboot集成druid

    OpenWrite版: 欢迎访问我的GitHub https://github.com/zq2599/blog_demos 内容:所有原创文章分类汇总及配套源码,涉及Java.Docker.Kuber ...

  10. 【System】I/O密集型和CPU密集型工作负载之间有什么区别

    CPU密集型(CPU-bound) CPU密集型也叫计算密集型,指的是系统的硬盘.内存性能相对CPU要好很多,此时,系统运作大部分的状况是CPU Loading 100%,CPU要读/写I/O(硬盘/ ...