How to reduce Index size on disk?减少ES索引大小的一些小手段
ES索引文件瘦身总结如下:
原始数据:
(1)学习splunk,原始data存big string
(2)原始文件还可以再度压缩
倒排索引:
(1)去掉不必要的倒排索引信息:例如文件位置倒排、_source和field store选择之一
(2)合并倒排文件,去掉一些冗余的小文件
(3)原始数据big string存储后负责ES聚合功能的doc_values去掉
(4)其他方面:倒排列表数据结构是skiplist本质是空间换时间,可考虑用有序数组存储。
Strange that I haven't receive any suggestion on my query anyways following are some steps which I performed to reduce index size .Hope it will help someone .Please feel free to add more in case I miss something .
1) Delete unnecessary fields (or do not index unwanted fields, I am handling it at the LS level)
2) Delete @message field (if Message field is not in use you can delete this)
3) Disable _all field ( Be careful with this setting )
It is a special catch-all field which concatenates the values of all of the other fields into one big string, using space as a delimiter. It requires extra CPU cycles and uses more disk space. If not needed, it can be completely disabled.
Benefits of having _All field enabled :- Allows you to search for values in documents without knowing which field contains the value, but CPU will be compromised .
Downside of Disabling this field :- Kibana Search bar will not act as full text search bar , so user have to fire query like name : “vikas” or name:vika* (provided name is an analyzed field ) . Also the _all field loses the distinction between field types like (string integer, or IP ) because it stores all the values as string.
4) Analyzed and Not Analyzed fields :- Be very careful while making a field Analyzed and Not analyzed because to perform partial search(name :vik*) we need analyzed field but it will consume more disk space . Recommended option is to make all the string fields to not analyzed in the first go and then make any filed as analyzed field if needed .
5) Doc_Value :-Doc values are the on-disk data structure, built at document index time, which makes this data access pattern possible. So, doc values offload this heap burden by writing the fielddata to disk at index time, thereby allowing Elasticsearch to load the values outside of your Java heap as they are needed. In the latest version of ES this feature has already been enabled .In our case we are on ES 1.7.1 version an we have to enable it explicitly which will consume extra Disk space but this does not degrade performance at all. The overall benefits of doc values significantly outweigh the cost.
Thanks
VG
摘自:https://discuss.elastic.co/t/how-to-reduce-index-size-on-disk/49415
下文来自:https://github.com/jordansissel/experiments/tree/master/elasticsearch/disk
logstash+elasticsearch storage experiments
These results are from an experiment done in 2012 and are irrelevant today.
Problem: Many users observe a 5x inflation of storage data from "raw logs" vs logstash data stored in elasticsearch.
Hypothesis: There are likely small optimizations we can make on the elasticsearch side to occupy less physical disk space.
Constraints: Data loss is not acceptable (can't just stop storing the logs)
Options:
- Compression (LZF and Snappy)
- Disable the '_all' field
- For parsed logs, there are lots of duplicate and superluous fields we can remove.
Discussion
The compression features really need no discussion.
The purpose of the '_all' field is documented in the link above. In logstash, users have reported success in disabling this feature without losing functionality.
In this scenario, I am parsing apache logs. Logstash reads lines from a file and sets the '@message' field to the contents of that line. After grok parses it and produces a nice structure, making fields like 'bytes', 'response', and 'clientip' available in the event, we no longer need the original log line, so it is quite safe to delete the @message (original log line) in this case. Doing this saves us much duplicate data in the event itself.
Test scenarios
- 0: test defaults
- 1: disable _all
- 2: store compress + disable _all
- 3: store compress w/ snappy + disable _all
- 4: compress + remove duplicate things (@message and @source)
- 5: compress + remove all superfluous things (simulate 'apache logs in json')
- 6: compress + remove all superfluous things + use 'grok singles'
Test data
One million apache logs from semicomplete.com:
% du -hs /data/jls/million.apache.logs
218M /data/jls/million.apache.logs
% wc -l /data/jls/million.apache.logs
1000000 /data/jls/million.apache.logs
Environment
This should be unrelated to the experiment, but including for posterity if the run-time of these tests is of interest to you.
- CPU: Xeon E31230 (4-core)
- Memory: 16GB
- Disk: Unknown spinning variety, 1TB
Results
run | space usage | elasticsearch/original ratio | run time (wall clock) |
ORIGIN | 218M /data/jls/million.apache.logs | N/A | N/A |
0 | 1358M /data/jls/millionlogstest/0.yml | 6.23x | 6m47.343s |
1 | 1183M /data/jls/millionlogstest/1.yml | 5.47x | 6m13.339s |
2 | 539M /data/jls/millionlogstest/2.yml | 2.47x | 6m17.103s |
3 | 537M /data/jls/millionlogstest/3.yml | 2.47x | 6m15.382s |
4 | 395M /data/jls/millionlogstest/4.yml | 1.81x | 6m39.278s |
5 | 346M /data/jls/millionlogstest/5.yml | 1.58x | 6m35.877s |
6 | 344M /data/jls/millionlogstest/6.yml | 1.57x | 6m27.440s |
Conclusion
This test confirms what many logstash users have already reported: it is easy to achieve a 5-6x increase in storage from raw logs caused by common logstash filter uses, for example grok.
Summary of test results:
- Enabling store compression uses 55% less storage
- Removing the @message and @source fields save you 26% of storage.
- Disabling the '_all' field saves you 13% in storage.
- Using grok with 'singles => true' had no meaningful impact.
- Compression ratios in LZF were the same as Snappy.
Final storage size was 25% the size of the common case (1358mb vs 344mb!)
Recommendations
- Always enable compression in elasticsearch.
- If you don't need the '_all' field, disable it.
- The 'remove fields' steps performed here will be unnecessary if you log directly in a structured format. For example, if you follow the 'apache log in json' logstash cookbook recipe, grok, date, and mutate filters here will not be necessary, meaning the only tuning you'll have to do is in disabling '_all' and enabling compression in elasticsearch.
Future Work
It's likely we can take this example of "ship apache 'combined format' access logs into logstash" a bit further and with some tuning improve storage a bit more.
For now, I am happy to have reduced the inflation from 6.2x to 1.58x :)
How to reduce Index size on disk?减少ES索引大小的一些小手段的更多相关文章
- LVM管理之减少LV的大小
LVM管理之减少LV的大小 规定动作 1.umount filesystem 2.e2fsck filesystem 3.resize2fs filesystem 4.lvredure 实例演示——— ...
- Handlebars.js循环中索引(@index)使用技巧(访问父级索引)
使用Handlebars.js过程中,难免会使用循环,比如构造数据表格.而使用循环,又经常会用到索引,也就是获取当前循环到第几次了,一般会以这个为序号显示在页面上. Handlebars.js中获取循 ...
- 【matlab】error:试图访问 im2(1,1211);由于 size(im2)=[675,1210],索引超出范围。
试图访问 im2(1,1211):由于 size(im2)=[675,1210],索引超出范围. 出错 dect (line 14) if abs((im2(i,j))-(im1(i,j)))> ...
- all index range ref eq_ref const system 索引type说明
背景知识 在使用sql的过程中经常需要建立索引,而每种索引是怎么处罚的又是怎么起到作用的,首先必须知道索引和索引的类型. 索引类型type 我们可以清楚的看到type那一栏有index ALL eq_ ...
- [RN] React Native 打包时 减少 Apk 的大小
React Native 打包时 减少 Apk 的大小 主要有两个方法: 在打包前设置 android\app\build.gradle 文件中 1) def enableProguardInRele ...
- LVM to increase and reduce 10G size for /data
=======================increase10G for/data=============================(system env /dev/MongoData00 ...
- jQuery的对象访问函数(get,index,size,each)
1.get() 元素集合 取得所有匹配的 DOM 元素集合. 这是取得所有匹配元素的一种向后兼容的方式(不同于jQuery对象,而实际上是元素数组). 如果你想要直接操作 DOM 对象而不是 jQue ...
- [PReact] Reduce the Size of a React App in Two Lines with preact-compat
Not every app is greenfield, and it would be a shame if existing React apps could not benefit from t ...
- unity, reduce android size
参考: https://www.youtube.com/watch?v=TYSmf_zgtZo http://stackoverflow.com/questions/41087220/how-to-u ...
随机推荐
- php党 强烈推荐TIPI:深入理解PHP内核
深入理解PHP内核(Thinking In PHP Internals) TIPI项目是一个自发项目, 项目主要关注PHP的内部实现, 以及PHP相关的方方面面, 该项目包括<深入理解PHP内核 ...
- 关于移动端border 1像素在不同分辨率下边显示粗细不一样的处理
最近开发发现一个很有趣的问题 就是我如果给一个元素加上一个像素的 border 在不同的分辨率的情况下显示的不同 在高清屏幕(尤其是ios 喽 不鄙视国产) 据说在6plus下会变成3px 这个我 ...
- 已备份数据库的磁盘结构版本号为611,server支持版本号为539,无法还原或升级数据库
提供的是bak文件是2005备份的,还原到本地的sqlserver2000,提示:已备份数据库的磁盘上结构版本号为611.服务器支持版本号539,无法还原或升级数据库. 网上找了下,原因是611是sq ...
- [Catalan数]1086 栈、3112 二叉树计数、3134 Circle
1086 栈 2003年NOIP全国联赛普及组 时间限制: 1 s 空间限制: 128000 KB 题目等级 : 黄金 Gold 题解 题目描述 Description 栈是计算机中 ...
- 【BZOJ4320】ShangHai2006 Homework 分段+并查集
[BZOJ4320]ShangHai2006 Homework Description 1:在人物集合 S 中加入一个新的程序员,其代号为 X,保证 X 在当前集合中不存在. 2:在当前的人 ...
- Java学习笔记——java介绍
Java开源语言 C语言闭源语言 IOS闭源系统 采用object-c语言开发 应用程序分类(从类型分类) C/S(Client Server):不联网的软件也属于C/S B/S(Browser S ...
- socket java 实例
简单的java socket 示例 一.搭建服务器端 a).创建ServerSocket对象绑定监听端口. b).通过accept()方法监听客户端的请求. c).建立连接后,通过输入输出流读取客户端 ...
- images have the “stationarity” property, which implies that features that are useful in one region are also likely to be useful for other regions.
Convolutional networks may include local or global pooling layers[clarification needed], which combi ...
- Linux软件包分类
源代码包 优点: 1.给你的就是源代码 2.可以修改源代码 3.可以自由选择所需的功能 4.软件是在自己电脑上编译安装,所以更加稳定高效 5.卸载方便(直接删了你安装软件的那个目录就好了) 缺点: 1 ...
- ACM-最小生成树之继续畅通project——hdu1879
版权声明:本文为博主原创文章,未经博主同意不得转载. https://blog.csdn.net/lx417147512/article/details/27092583 ************** ...