一次 ElasticSearch 搜索优化

1. 环境

ES6.3.2,索引名称 user_v1,5个主分片,每个分片一个副本。分片基本都在11GB左右,GET _cat/shards/user

一共有3.4亿文档,主分片总共57GB。

Segment信息:curl -X GET "221.228.105.140:9200/_cat/segments/user_v1?v" >> user_v1_segment

user_v1索引一共有404个段:

cat user_v1_segment | wc -l

404

处理一下数据,用Python画个直方图看看效果:

sed -i '1d' file # 删除文件第一行

awk -F ' ' '{print $7}' user_v1_segment >> docs_count # 选取感兴趣的一列(docs.count 列)

with open('doc_count.txt') as f:
data=f.read()
docList = data.splitlines()
docNums = list(map(int,docList))
import matplotlib.pyplot as plt
plt.hist(docNums,bins=40,normed=0,facecolor='blue',edgecolor='black')

大概看一下每个Segment中包含的文档的个数。横坐标是:文档数量,纵坐标是:segment个数。可见:大部分的Segment中只包含了少量的文档(\(0.5*10^7\))

修改refresh_interval为30s,原来默认为1s,这样能在一定程度上减少Segment的数量。然后先force merge将404个Segment减少到200个:

POST /user_v1/_forcemerge?only_expunge_deletes=false&max_num_segments=200&flush=true

但是一看,还是有312个Segment。这个可能与merge的配置有关了。有兴趣的可以了解一下 force merge 过程中这2个参数的意义:

  • ​ merge.policy.max_merge_at_once_explicit
  • ​ merge.scheduler.max_merge_count

执行profile分析:

1,Collector 时间过长,有些分片耗时长达7.9s。关于Profile 分析,可参考:profile-api

2,采用HanLP 分词插件,Analyzer后得到Term,居然有"空格Term",而这个Term的匹配长达800ms!

来看看原因:

POST /_analyze

{

"analyzer": "hanlp_standard",

"text":"人生 如梦"

}

分词结果是包含了空格的:

{
"tokens": [
{
"token": "人生",
"start_offset": 0,
"end_offset": 2,
"type": "n",
"position": 0
},
{
"token": " ",
"start_offset": 0,
"end_offset": 1,
"type": "w",
"position": 1
},
{
"token": "如",
"start_offset": 0,
"end_offset": 1,
"type": "v",
"position": 2
},
{
"token": "梦",
"start_offset": 0,
"end_offset": 1,
"type": "n",
"position": 3
}
]
}

那实际文档被Analyzer了之后是否存储了空格呢?

于是先定义一个索引,开启term_vector。参考store term-vector

PUT user
{
"settings": {
"number_of_shards": 1,
"number_of_replicas": 0
},
"mappings": {
"profile": {
"properties": {
"nick": {
"type": "text",
"analyzer": "hanlp_standard",
"term_vector": "yes",
"fields": {
"raw": {
"type": "keyword"
}
}
}
}
}
}
}

然后PUT一篇文档进去:

PUT user/profile/1
{
"nick":"人生 如梦"
}

查看Term Vector:docs-termvectors

GET /user/profile/1/_termvectors
{
"fields" : ["nick"],
"offsets" : true,
"payloads" : true,
"positions" : true,
"term_statistics" : true,
"field_statistics" : true
}

发现存储的Terms里面有空格。

{
"_index": "user",
"_type": "profile",
"_id": "1",
"_version": 1,
"found": true,
"took": 2,
"term_vectors": {
"nick": {
"field_statistics": {
"sum_doc_freq": 4,
"doc_count": 1,
"sum_ttf": 4
},
"terms": {
" ": {
"doc_freq": 1,
"ttf": 1,
"term_freq": 1
},
"人生": {
"doc_freq": 1,
"ttf": 1,
"term_freq": 1
},
"如": {
"doc_freq": 1,
"ttf": 1,
"term_freq": 1
},
"梦": {
"doc_freq": 1,
"ttf": 1,
"term_freq": 1
}
}
}
}
}

然后再执行profile 查询分析:

GET user/profile/_search?human=true
{
"profile":true,
"query": {
"match": {
"nick": "人生 如梦"
}
}
}

发现Profile里面居然有针对 空格Term 的查询!!!(注意 nick 后面有个空格)

            "type": "TermQuery",
"description": "nick: ",
"time": "58.2micros",
"time_in_nanos": 58244,

profile结果如下:

 "profile": {
"shards": [
{
"id": "[7MyDkEDrRj2RPHCPoaWveQ][user][0]",
"searches": [
{
"query": [
{
"type": "BooleanQuery",
"description": "nick:人生 nick: nick:如 nick:梦",
"time": "642.9micros",
"time_in_nanos": 642931,
"breakdown": {
"score": 13370,
"build_scorer_count": 2,
"match_count": 0,
"create_weight": 390646,
"next_doc": 18462,
"match": 0,
"create_weight_count": 1,
"next_doc_count": 2,
"score_count": 1,
"build_scorer": 220447,
"advance": 0,
"advance_count": 0
},
"children": [
{
"type": "TermQuery",
"description": "nick:人生",
"time": "206.6micros",
"time_in_nanos": 206624,
"breakdown": {
"score": 942,
"build_scorer_count": 3,
"match_count": 0,
"create_weight": 167545,
"next_doc": 1493,
"match": 0,
"create_weight_count": 1,
"next_doc_count": 2,
"score_count": 1,
"build_scorer": 36637,
"advance": 0,
"advance_count": 0
}
},
{
"type": "TermQuery",
"description": "nick: ",
"time": "58.2micros",
"time_in_nanos": 58244,
"breakdown": {
"score": 918,
"build_scorer_count": 3,
"match_count": 0,
"create_weight": 46130,
"next_doc": 964,
"match": 0,
"create_weight_count": 1,
"next_doc_count": 2,
"score_count": 1,
"build_scorer": 10225,
"advance": 0,
"advance_count": 0
}
},
{
"type": "TermQuery",
"description": "nick:如",
"time": "51.3micros",
"time_in_nanos": 51334,
"breakdown": {
"score": 888,
"build_scorer_count": 3,
"match_count": 0,
"create_weight": 43779,
"next_doc": 1103,
"match": 0,
"create_weight_count": 1,
"next_doc_count": 2,
"score_count": 1,
"build_scorer": 5557,
"advance": 0,
"advance_count": 0
}
},
{
"type": "TermQuery",
"description": "nick:梦",
"time": "59.1micros",
"time_in_nanos": 59108,
"breakdown": {
"score": 3473,
"build_scorer_count": 3,
"match_count": 0,
"create_weight": 49739,
"next_doc": 900,
"match": 0,
"create_weight_count": 1,
"next_doc_count": 2,
"score_count": 1,
"build_scorer": 4989,
"advance": 0,
"advance_count": 0
}
}
]
}
],
"rewrite_time": 182090,
"collector": [
{
"name": "CancellableCollector",
"reason": "search_cancelled",
"time": "25.9micros",
"time_in_nanos": 25906,
"children": [
{
"name": "SimpleTopScoreDocCollector",
"reason": "search_top_hits",
"time": "19micros",
"time_in_nanos": 19075
}
]
}
]
}
],
"aggregations": []
}
]
}

而在实际的生产环境中,空格Term的查询耗时480ms,而一个正常词语("微信")的查询,只有18ms。如下在分片[user_v1][3]上的profile分析结果:

 "profile": {
"shards": [
{
"id": "[8eN-6lsLTJ6as39QJhK5MQ][user_v1][3]",
"searches": [
{
"query": [
{
"type": "BooleanQuery",
"description": "nick:微信 nick: nick:黄色",
"time": "888.6ms",
"time_in_nanos": 888636963,
"breakdown": {
"score": 513864260,
"build_scorer_count": 50,
"match_count": 0,
"create_weight": 93345,
"next_doc": 364649642,
"match": 0,
"create_weight_count": 1,
"next_doc_count": 5063173,
"score_count": 4670398,
"build_scorer": 296094,
"advance": 0,
"advance_count": 0
},
"children": [
{
"type": "TermQuery",
"description": "nick:微信",
"time": "18.4ms",
"time_in_nanos": 18480019,
"breakdown": {
"score": 656810,
"build_scorer_count": 62,
"match_count": 0,
"create_weight": 23633,
"next_doc": 17712339,
"match": 0,
"create_weight_count": 1,
"next_doc_count": 7085,
"score_count": 5705,
"build_scorer": 74384,
"advance": 0,
"advance_count": 0
}
},
{
"type": "TermQuery",
"description": "nick: ",
"time": "480.5ms",
"time_in_nanos": 480508016,
"breakdown": {
"score": 278358058,
"build_scorer_count": 72,
"match_count": 0,
"create_weight": 6041,
"next_doc": 192388910,
"match": 0,
"create_weight_count": 1,
"next_doc_count": 5056541,
"score_count": 4665006,
"build_scorer": 33387,
"advance": 0,
"advance_count": 0
}
},
{
"type": "TermQuery",
"description": "nick:黄色",
"time": "3.8ms",
"time_in_nanos": 3872679,
"breakdown": {
"score": 136812,
"build_scorer_count": 50,
"match_count": 0,
"create_weight": 5423,
"next_doc": 3700537,
"match": 0,
"create_weight_count": 1,
"next_doc_count": 923,
"score_count": 755,
"build_scorer": 28178,
"advance": 0,
"advance_count": 0
}
}
]
}
],
"rewrite_time": 583986593,
"collector": [
{
"name": "CancellableCollector",
"reason": "search_cancelled",
"time": "730.3ms",
"time_in_nanos": 730399762,
"children": [
{
"name": "SimpleTopScoreDocCollector",
"reason": "search_top_hits",
"time": "533.2ms",
"time_in_nanos": 533238387
}
]
}
]
}
],
"aggregations": []
},

由于我采用的是HanLP分词,用的这个分词插件elasticsearch-analysis-hanlp,而采用ik_max_word分词却没有相应的问题,这应该是分词插件的bug,于是去github上提了一个issue,有兴趣的可以关注。看来我得去研究一下ElasticSearch Analyze整个流程的源码以及加载插件的源码了 :

一次 ElasticSearch 搜索优化的更多相关文章

  1. ElasticSearch搜索介绍四

    ElasticSearch搜索 最基础的搜索: curl -XGET http://localhost:9200/_search 返回的结果为: { "took": 2, &quo ...

  2. ElasticSearch性能优化策略【转】

    ElasticSearch性能优化主要分为4个方面的优化. 一.服务器部署 二.服务器配置 三.数据结构优化 四.运行期优化 一.服务器部署 1.增加1-2台服务器,用于负载均衡节点 elasticS ...

  3. 亿级 Elasticsearch 性能优化

    前言 最近一年使用 Elasticsearch 完成亿级别日志搜索平台「ELK」,亿级别的分布式跟踪系统.在设计这些系统的过程中,底层都是采用 Elasticsearch 来做数据的存储,并且数据量都 ...

  4. Elasticsearch搜索调优权威指南 (2/3)

    本文首发于 vivo互联网技术 微信公众号 https://mp.weixin.qq.com/s/AAkVdzmkgdBisuQZldsnvg 英文原文:https://qbox.io/blog/el ...

  5. Elasticsearch搜索调优权威指南 (1/3)

    本文首发于 vivo互联网技术 微信公众号 https://mp.weixin.qq.com/s/qwkZKLb_ghmlwrqMkqlb7Q英文原文:https://qbox.io/blog/ela ...

  6. Elasticsearch搜索资料汇总

    Elasticsearch 简介 Elasticsearch(ES)是一个基于Lucene 构建的开源分布式搜索分析引擎,可以近实时的索引.检索数据.具备高可靠.易使用.社区活跃等特点,在全文检索.日 ...

  7. 【随笔】Android应用市场搜索优化(ASO)

    参考了几篇网上的关于Android应用市场搜索优化(ASO)的分析,总结几点如下: 优化关键字:举例目前美团酒店在各Android市场上的关键字有“美团酒店.钟点房.团购.7天”等等,而酒店类竞对在“ ...

  8. Elasticsearch搜索结果返回不一致问题

    一.背景 这周在使用Elasticsearch搜索的时候遇到一个,对于同一个搜索请求,会出现top50返回结果和排序不一致的问题.那么为什么会出现这样的问题? 后来通过百度和google,发现这是因为 ...

  9. 针对TianvCms的搜索优化文章源码(无版权, 随便用)

    介绍: 搜索优化虽然不是什么高深的技术, 真正实施起来却很繁琐, 后台集成搜索优化的文章可以便于便于管理, 也让新手更明白优化的步奏以及优化的日常. 特点: 根据自己的经验和查阅各种资料整理而成, 相 ...

随机推荐

  1. SQL SERVER 临时数据库 tempdb 迁移或增加文件

    临时数据库TempDB 虽然是临时库,但对整个数据库系统性能却起到很关键的作用:平时用到的中间数据集会暂时保存到TempDB 中,比如:临时表,排序,临时统计信息,一些中间结果数据,索引重建 等.我们 ...

  2. SQLServer之多表联合查询

    多表联合查询简介 定义:连接查询是关系型数据库最主要的查询,通过连接运算符可以实现多个表连接数据查询. 分类:内连接,外连接,全外连接. 内连接 定义 内联接使用比较运算符根据每个表的通用列中的值匹配 ...

  3. Windows中查看进程的资源消耗(cpu, Disk,Memory,NetWork)

    1.通过Windows Task Manager 的 Performance Tab 可以看到总体的性能消耗情况. 2.如果想看系统中每个进程的资源消耗,可以点击 下面的 Open Resource ...

  4. 求出100以内的素数(java实现)

    j package test1; //2018/11/30 //求100以内的所有素数 public class Main10 { public static void main(String[] a ...

  5. $.each()、$.map()区别浅谈

    遍历应该是各种语言中常会用到的操作了,实现的方法也很多,例如使用for.while等循环语句就可以很轻松的做到对数组或对象的遍历,今天想讲的不是它们,而是简单方便的遍历方法. 大致的整理了一下,经常用 ...

  6. RecyclerView的Item的单击事件

    RecyclerView 的每个Item的点击事件并没有像ListView一样封装在组件中,需要Item的单击事件时就需要自己去实现,在Adapter中为RecyclerView添加单击事件参考如下: ...

  7. MyCP

    一.作业要求 编写MyCP.java 实现类似Linux下cp  XXX1 XXX2的功能,要求MyCP支持两个参数:- java MyCP -tx XXX1.txt XXX2.bin  用来把文本文 ...

  8. Elastic Stack-Kibana使用介绍(七)

    一.前言     主要来讲述一下Kibana使用以及上生产时候的一些配置,要是大家对这块比较感兴趣我到时候也可以在结合Grafana做一些图表方面的介绍,后面等介绍完Beats以后我去阿里云租几台机器 ...

  9. DRF限制访问频次

    官方文档:https://www.django-rest-framework.org/api-guide/throttling/ 1.什么场景下需要限制访问频次呢? 1)防爬虫:爬虫可能会在短时间内大 ...

  10. Python之常用第三方库总结

    在使用python进行开发的时候,经常我们需要借助一些第三方库,进行日常代码的开发工作.这里总结一些常用的类库 1. requests Requests 是用Python语言编写,基于 urllib, ...