Elasticsearch: Five Things I was Doing Wrong
Elasticsearch: Five Things I was Doing Wrong
Update: Also check out my series on scaling Elasticsearch.
I’ve been working with Elasticsearch off and on for over a year, but recently I attended Elasticsearch.com’s training class (well worth the time and money) and discovered a few significant things that I was doing just plain wrong.
Before using Elasticsearch I used Lucene directly, and so a few of the errors I made were due to not understanding some of the things ES does for you behind the scenes.
As background, most of the data I’m indexing conforms to the WordPress database schema.
1. Use Arrays for Fields with Multiple Values
For some reason I had neglected to use arrays when creating fileds such as a list of tags attached to a document. At some point I started concatenating the tags together into a long string separated by semicolons and I used a custom analyzer to break them apart like this:
"analysis" : {
"tokenizer" : {
"semicolon_token" : {
"type" => "pattern",
"pattern" => ";"
} },
"analyzer" : {
"wp_tag_analyzer" : {
"type" => "custom",
"tokenizer" => "semicolon_token",
} }
}
Or, for fields that were lists of URLs I just separated them by spaces and used the whitespace analyzer. Both methods worked fine for the initial applications, but have some obvious drawbacks. Explicitly inserting a character sequence as a delimiter will almost always means you will hit an edge case somewhere down the road where it will break.
Using an array of items is a much easier way, but somehow, after initially reading about the array mapping, I completely forgot that it existed. I think I was thinking of ES too much as a text searching engine and not enough as a general JSON data store.
2. Don’t Use store=true When Mapping Fields
If you are storing the full _source of the document, then there is very little reason to store individual fields separately. You just inflate your index size. I originally started storing the content and titles of documents because I thought it might speed up the highlighting. In practice, I don’t think it did anything for me, and many of our queries don’t do any highlighting at all.
In the end this was a case of premature optimization. Maybe at some point if I find that 90% of the time we are just returning the post_id and using that to lookup the original content in MySQL we’ll consider storing that separately to reduce network traffic and load caused by extracting the post_id field from _source, but that still feels premature at this point.
For debugging reasons I would never consider turning off storing _source. It is far too useful to know exactly what data was entered, and you never know when you might want to use a different field for a new application.
3. Don’t Manually Flush, Optimize, or Refresh
Elasticsearch takes care of these core Lucene operations for me, there was never any good reason for me to issue one of these commands when the default ES settings would accomplish it within a few minutes.
The optimize command in particular is dangerous since it merges all segments in the Lucene index (a very time consuming operation). The code I wrote which at first was issuing innocuous optimize commands after doing some bulk indexing by hand eventually started getting called repeatedly in automated jobs. Fortunately it never rose to a level of causing real problems, but its easy for code you write to get unintentionally called.
Again, this was a case of premature optimization.
4. Set the Appropriate Production Flags
This is another case that didn’t cause a real issue, but could have in the future. The default settings for ES are set to ensure it works to quickly start development. This means that a few of the default settings are not what you want when in production. In particular:
- discovery.zen.minimum_master_nodes
- Should be set to something like N/2 + 1 where N is the number of available master nodes.
- action.disable_delete_all_indices
- Do you really want to allow a single command (that could be mistyped) to delete all of your indices? No, neither do I.
- gateway.recover_after_nodes
- How many nodes need to be up before the recovery process starts replicating data around the cluster.
- index.cache.field.type: soft (in 0.90 this field name changed to index.fielddata.cache. Thanks Olivier for the heads up.)
- I started setting my field cache to soft to ensure that it never created OutOfMemory errors. I think this was particularly helpful because we are doing a lot of faceting.
- Update 2014-01-09: the indices.fielddata.cache.size setting introduced in 0.90 is a better way to prevent running into OutOfMemory exceptions due to the field cache getting too big. I am no longer using the soft field data cache.
5. Do Not Use _type as Another Field
The _type field can entice you to use it as another field to indicate a category for your document. Don’t let it.
Here’s where I went wrong. WordPress posts can have different types (post_type) which allow displaying the content of the post in different ways (e.g. image posts, video posts, quotes, a status message). This despite the different post types all using the same schema. This seemed to align pretty well with the _type fields so I used an ES dynamic mapping to have post_type == _type.
The biggest problem with this is how do you determine the document’s _type after a post has been deleted from the database and you want to also delete it from your index. A document is uniquely identified both by its _id and its _type.
- If you delete from your RDBMS first (or NoSQL data store flavor of the month), then you may no longer have the
_typeavailable to delete the object. - If you delete from ES first then what if the RDBMS delete operation fails for some reason.
Making the _type independent of any data within the document ensures that all you will need is the document id. This was one of those “Oh, that was dumb of me” bugs that I completely missed when building my index.
转自:https://greg.blog/2013/01/24/elasticsearch-five-things-i-was-doing-wrong/
Elasticsearch: Five Things I was Doing Wrong的更多相关文章
- Elasticsearch之java的基本操作一
摘要 接触ElasticSearch已经有一段了.在这期间,遇到很多问题,但在最后自己的不断探索下解决了这些问题.看到网上或多或少的都有一些介绍ElasticSearch相关知识的文档,但个人觉得 ...
- Elasticsearch 5.0 中term 查询和match 查询的认识
Elasticsearch 5.0 关于term query和match query的认识 一.基本情况 前言:term query和match query牵扯的东西比较多,例如分词器.mapping ...
- 以bank account 数据为例,认识elasticsearch query 和 filter
Elasticsearch 查询语言(Query DSL)认识(一) 一.基本认识 查询子句的行为取决于 query context filter context 也就是执行的是查询(query)还是 ...
- Ubuntu 14.04中Elasticsearch集群配置
Ubuntu 14.04中Elasticsearch集群配置 前言:本文可用于elasticsearch集群搭建参考.细分为elasticsearch.yml配置和系统配置 达到的目的:各台机器配置成 ...
- ElasticSearch 5学习(10)——结构化查询(包括新特性)
之前我们所有的查询都属于命令行查询,但是不利于复杂的查询,而且一般在项目开发中不使用命令行查询方式,只有在调试测试时使用简单命令行查询,但是,如果想要善用搜索,我们必须使用请求体查询(request ...
- ElasticSearch 5学习(9)——映射和分析(string类型废弃)
在ElasticSearch中,存入文档的内容类似于传统数据每个字段一样,都会有一个指定的属性,为了能够把日期字段处理成日期,把数字字段处理成数字,把字符串字段处理成字符串值,Elasticsearc ...
- .net Elasticsearch 学习入门笔记
一. es安装相关1.elasticsearch安装 运行http://localhost:9200/2.head插件3.bigdesk插件安装(安装细节百度:windows elasticsear ...
- 自己写的数据交换工具——从Oracle到Elasticsearch
先说说需求的背景,由于业务数据都在Oracle数据库中,想要对它进行数据的分析会非常非常慢,用传统的数据仓库-->数据集市这种方式,集市层表会非常大,查询的时候如果再做一些group的操作,一个 ...
- 如何在Elasticsearch中安装中文分词器(IK+pinyin)
如果直接使用Elasticsearch的朋友在处理中文内容的搜索时,肯定会遇到很尴尬的问题--中文词语被分成了一个一个的汉字,当用Kibana作图的时候,按照term来分组,结果一个汉字被分成了一组. ...
- jar hell & elasticsearch ik 版本问题
想给es 安装一个ik 的插件, 我的es 是 2.4.0, 下载了一个版本是 1.9.5, [2016-10-09 16:56:26,248][INFO ][node ] [node-2] init ...
随机推荐
- 高速掌握Lua 5.3 —— Lua与C之间的交互概览
Q:什么是Lua的虚拟栈? A:C与Lua之间通信关键内容在于一个虚拟的栈.差点儿全部的调用都是对栈上的值进行操作,全部C与Lua之间的数据交换也都通过这个栈来完毕.另外,你也能够使用栈来保存暂时变量 ...
- 编译-O 选项对性能提升作用
GCC -O 选项 这个选项控制所有的优化等级.使用优化选项会使编译过程耗费更多的时间,并且占用更多的内存,尤其是在提高优化等级的时候. -O设置一共有五种:-O0.-O1.-O2.-O3和-Os. ...
- linux init->upstart->systemd
http://en.wikipedia.org/wiki/Init init From Wikipedia, the free encyclopedia This article is abo ...
- 应对ie双外边距,不使用hack
1.在浮动元素内层加一层div 2.使用不浮动的内层外边距来定义距离 ie在浮动时,并且使用外边距,会产生双倍外边距.
- 对‘TIFFReadDirectory@LIBTIFF_4.0’未定义的引用-------------- 解决办法
ABLE_DEPRECATED' is defined [-Winvalid-pch] //usr/lib/libvtkIO.so.5.10:对‘TIFFReadDirectory@LIBTIFF_4 ...
- 基于RedHat发行的Apache Tomcat本地提权漏洞
描述 Tomcat最近总想搞一些大新闻,一个月都没到,Tomcat又爆出漏洞.2016年10月11日,网上爆出Tomcat本地提权漏洞,漏洞编号为CVE-2016-5425.此次受到影响的主要是基于R ...
- Socket的UDP协议在erlang中的实现
现在我们看看UDP协议(User Datagram Protocol,用户数据报协议).使用UDP,互联网上的机器之间可以互相发送小段的数据,叫做数据报.UDP数据报是不可靠的,这意味着如果客户端发送 ...
- ASP.NET动态网站制作(0)
前言:一直想系统地学习一下网站建设的相关内容,看过相关的书籍,也跟着视频学过,但总觉得效率不高,学过的东西印象不深刻,或许还是自己动手实践的少.无意中免费听了一堂讲ASP.NET网站建设的课,觉得性价 ...
- iOS 键盘变中文
plist文件添加 Localizations 添加一项字段Chinese (simplified)
- TP框架---thinkphp修改删除数据
1.在控制器MainController里面写一个方法,调用Nation表中的数据. public function zhuyemian() { $n = D("Nation"); ...