Column-store compression

At a high level, doc values are essentially a serialized column-store. As we discussed in the last section, column-stores excel at certain operations because the data is naturally laid out in a fashion that is amenable to those queries.

But they also excel at compressing data, particularly numbers. This is important for both saving space on disk and for faster access. Modern CPU’s are many orders of magnitude faster than disk drives (although the gap is narrowing quickly with upcoming NVMe drives). That means it is often advantageous to minimize the amount of data that must be read from disk, even if it requires extra CPU cycles to decompress.

To see how it can help compression, take this set of doc values for a numeric field:

Doc      Terms
-----------------------------------------------------------------
Doc_1 | 100
Doc_2 | 1000
Doc_3 | 1500
Doc_4 | 1200
Doc_5 | 300
Doc_6 | 1900
Doc_7 | 4200
-----------------------------------------------------------------

The column-stride layout means we have a contiguous block of numbers:[100,1000,1500,1200,300,1900,4200].

xxx

Doc values use several tricks like this. In order, the following compression schemes are checked:

  1. If all values are identical (or missing), set a flag and record the value
  2. If there are fewer than 256 values, a simple table encoding is used
  3. If there are > 256 values, check to see if there is a common divisor
  4. If there is no common divisor, encode everything as an offset from the smallest value

You’ll note that these compression schemes are not "traditional" general purpose compression like DEFLATE or LZ4. Because the structure of column-stores are rigid and well-defined, we can achieve higher compression by using specialized schemes rather than the more general compression algorithms like LZ4.

You may be thinking "Well that’s great for numbers, but what about strings?" Strings are encoded similarly, with the help of an ordinal table. The strings are de-duplicated and sorted into a table, assigned an ID, and then those ID’s are used as numeric doc values. Which means strings enjoy many of the same compression benefits that numerics do.

The ordinal table itself has some compression tricks, such as using fixed, variable or prefix-encoded strings.

   

摘自:https://www.elastic.co/guide/en/elasticsearch/guide/current/_deep_dive_on_doc_values.html

列存储压缩技巧,除公共除数或者同时减去最小数,字符串压缩的话,直接去重后用数字ID压缩的更多相关文章

  1. ES doc_values介绍——本质是field value的列存储,做聚合分析用,ES默认开启,会占用存储空间(列存储压缩技巧,除公共除数或者同时减去最小数,字符串压缩的话,直接去重后用数字ID压缩)

    doc_values Doc values are the on-disk data structure, built at document index time, which makes this ...

  2. Oracle 12.1.0.2 New Feature翻译学习【In-Memory column store内存列存储】【原创】

    翻译没有追求信达雅,不是为了学英语翻译,是为了快速了解新特性,如有语义理解错误可以指正.欢迎加微信12735770或QQ12735770探讨oracle技术问题:) In-Memory Column ...

  3. SQL Server 列存储索引 第三篇:维护

    列存储索引分为两种类型:聚集的列存储索引和非聚集的列存储索引,在一个表上只能创建一个聚集索引,要么是聚集的列存储索引,要么是聚集的行存储索引,然而一个表上可以创建多个非聚集索引. 一,创建列存储索引 ...

  4. lucene底层数据结构——FST,针对field使用列存储,delta encode压缩doc ids数组,LZ4压缩算法

    参考: http://www.slideshare.net/lucenerevolution/what-is-inaluceneagrandfinal http://www.slideshare.ne ...

  5. Druid(准)实时分析统计数据库——列存储+高效压缩

    Druid是一个开源的.分布式的.列存储系统,特别适用于大数据上的(准)实时分析统计.且具有较好的稳定性(Highly Available). 其相对比较轻量级,文档非常完善,也比较容易上手. Dru ...

  6. 腾讯Hermes设计概要——数据分析用的是列存储,词典文件前缀压缩,倒排文件递增id、变长压缩、依然是跳表-本质是lucene啊

    转自:http://data.qq.com/article?id=817 三.Hermes设计概要 架构描述 系统核心进程均采用分散化设计,根据业务发展需求,可随意扩缩容机器; 周期性数据直接通过td ...

  7. 列存储段消除(ColumnStore Segment Elimination)

    列存储索引是好的!对于数据仓库和报表工作量,它们是真正的性能加速器.与聚集列存储结合,你会在常规行存储索引(聚集索引,非聚集索引)上获得巨大的压缩好处.而且创建聚集列存储索引非常简单: CREATE ...

  8. SQL Server 2014聚集列存储索引

    转发请注明引用和原文博客(http://www.cnblogs.com/wenBlog) 简介 之前已经写过两篇介绍列存储索引的文章,但是只有非聚集列存储索引,今天再来简单介绍一下聚集的列存储索引,也 ...

  9. SQL Server 2014新特性探秘(3)-可更新列存储聚集索引

    简介      列存储索引其实在在SQL Server 2012中就已经存在,但SQL Server 2012中只允许建立非聚集列索引,这意味着列索引是在原有的行存储索引之上的引用了底层的数据,因此会 ...

随机推荐

  1. Linux的各个文件夹名称解释(FHS)

    对于接触和已经接触过一段时间Linux的使用者来说,系统的各个文件夹名字还是挺让人费解的,什么etc,usr,var等等,大部分也是耳濡目染才有一个大概的概念,例如usr是存放自己编译安装的软件,et ...

  2. C语言基础知识【存储类】

    C 存储类1.存储类定义 C 程序中变量/函数的范围(可见性)和生命周期.这些说明符放置在它们所修饰的类型之前autoregisterstaticextern2.auto 只能用在函数内,即 auto ...

  3. Mark指针的指针(**)和链表使用(*&)

    利用二级指针删除单向链表 彻底理解链表中为何使用指针的指针或者指针的引用 详解C++指针的指针和指针的引用

  4. SWD下载调试填坑,SWD连接丢失问题解决

    野火SWD下载器,设置好以后,第一次下载成功,莫名其妙丢失连接,发现在复位状态可以连接(惊奇) 网络上搜索到把Boot0和Boot1置高,就可以把程序下载到RAM里, 能下载以后就好办了,把程序里SW ...

  5. 开启貌似已经过时很久的新坑:SharePoint服务器端对象模型

    5年前(嗯,是5年前),SharePoint 2010刚发布的时候,曾经和kaneboy试图一起写一本关于SharePoint 2010开发的书,名字叫<SharePoint 2010 应用开发 ...

  6. CMD命令获取电脑所有连接过的WiFi密码

    cmd中输入命令:for /f "skip=9 tokens=1,2 delims=:" %i in ('netsh wlan show profiles') do  @echo ...

  7. 算法设计 mac 字符串 标识 n维度 2 3维度 字符串 标识值 特征值

    基向量

  8. php自定义函数: 加密下载地址

    function getdownurl($downurl, $extime = "3600", $serverid = 1) { if (empty($downurl)) { re ...

  9. 【python】-- 队列(Queue)、生产者消费者模型

    队列(Queue) 在多个线程之间安全的交换数据信息,队列在多线程编程中特别有用 队列的好处: 提高双方的效率,你只需要把数据放到队列中,中间去干别的事情. 完成了程序的解耦性,两者关系依赖性没有不大 ...

  10. Java基础 - 变量转换

    在java中变量转发分为两种,隐式转换和强制转换 隐式转换: byte a = 10; int b = 20; byte c = a + b; // 该方法会报错,转换过程中字节数只能从小变大,不能从 ...