1.ES大量做FULL GC,日志如下:

[2016-12-15 14:53:21,496][WARN ][monitor.jvm ] [vsp4] [gc][old][94725][4389] duration [26.9s], collections [1]/[27s], total [26.9s]/[15.9h], memory [19.7gb]->[17gb]/[19.8gb], all_pools {[young] [1.1gb]->[43.1mb]/[1.1gb]}{[survivor] [130.2mb]->[0b]/[149.7mb]}{[old] [18.5gb]->[16.9gb]/[18.5gb]}
[2016-12-15 14:53:57,117][WARN ][monitor.jvm ] [vsp4] [gc][old][94731][4390] duration [29.9s], collections [1]/[30.4s], total [29.9s]/[15.9h], memory [18.6gb]->[18gb]/[19.8gb], all_pools {[young] [71.1mb]->[51.8mb]/[1.1gb]}{[survivor] [149.7mb]->[0b]/[149.7mb]}{[old] [18.4gb]->[18gb]/[18.5gb]}
[2016-12-15 14:54:31,246][WARN ][monitor.jvm ] [vsp4] [gc][old][94735][4391] duration [30.6s], collections [1]/[31.1s], total [30.6s]/[15.9h], memory [18.5gb]->[17.9gb]/[19.8gb], all_pools {[young] [14.3mb]->[1.3mb]/[1.1gb]}{[survivor] [22.1mb]->[0b]/[149.7mb]}{[old] [18.4gb]->[17.9gb]/[18.5gb]}

ES内存配置策略有2点:

1.不超过可用内存的50%

2.不超过32G

fielddata加载数据到内存是按index来的,不会只加载检索结果数据,indices.fielddata.cache.size(5gb or 20%)控制fielddata可用内存,内存不够时,淘汰老数据,ES默认不淘汰。设置该值并不好,这样内存不够时每次会从磁盘读取,引起大量磁盘I/O,但如果想要ES只缓存最近的数据到内存,需要配置。

 监控fielddata

  • per-index using the indices-stats API:

    GET /_stats/fielddata?fields=*
  • per-node using the nodes-stats API:

    GET /_nodes/stats/indices/fielddata?fields=*
  • Or even per-index per-node:
GET /_nodes/stats/indices/fielddata?level=indices&fields=*

By setting ?fields=*, the memory usage is broken down for each field.

fielddata circuit breaker可以在fielddata加载到内存前预估内存是否够用,如果内存不够用而继续读取fielddata到内存会导致内存溢出

连接:https://www.elastic.co/guide/en/elasticsearch/guide/current/_limiting_memory_usage.html

2. 检索报错

错误日志如下:

Failed to execute phase [query], all shards failed; shardFailures {[7l4w6bMqTReFs68KxMe1LA][smart_metadata-2015010100-2015010800][0]: RemoteTransportException[[vsp4][10.17.139.128:9300][indices:data/read/search[phase/query]]]; nested: EsRejectedExecutionExce        ption[rejected execution of org.elasticsearch.transport.TransportService$4@3b702529 on EsThreadPoolExecutor[search, queue capacity = 10000, org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor@6a6693a8[Running, pool size = 37, active threads = 37, queue        d tasks = 10000, completed tasks = 269010]]]; }

解决办法:

修改elasticsearch.yml

threadpool.bulk.type: fixed
Ÿthreadpool.bulk.size: 120
threadpool.bulk.queue_size: -1
Ÿthreadpool.search.queue_size: -1

EsRejectedExecutionException in elasticsearch for parallel search

Answer1:

Elasticsearch has a thread pool and a queue for search per node. A thread pool will have N number of workers ready to handle the requests. When a request comes and if a worker is free , this is handled by the worker. Now by default the number of workers is equal to the number of cores on that CPU. When the workers are full and there are more search requests , the request will go to queue. The size of queue is also limited. Its by default size is say 100 and if there happens more parallel requests than this , then those requests would be rejected as you can see in the error log.

The solution to this would be to -

  1. Increase the size of queue or threadpool - The immediate solution for this would be to increase the size of the search queue. We can also increase the size of threadpool , but then that might badly effect the performance of individual queries. So increasing the queue might be a good idea. But then remember that this queue is memory residential and increasing the queue size too much can result in Out Of Memory issues. You can get more info on the samehere.
  2. Increase number of nodes and replicas - Remember each node has its own search threadpool/queue. Also search can happen on primary shard OR replica.

Answer2:

Maybe it sounds strange, but you need to lower the parallel searches count. With that exception, Elasticsearch tells you that you are overloading it. There are some limits (at thread count level) that are set in Elasticsearch and, most of the times, the defaults for these limits are the best option. So, if you are testing your cluster to see how much load it can hold, this would be an indicator that some limits have been reached.

Alternatively, if you really want to change the default you can try increasing the queue size for searches to accommodate the concurrency demands, but keep in mind that the larger the queue size, the more pressure you put on your cluster that, in the end, will cause instability.

ES Thread Pool

A node holds several thread pools in order to improve how threads memory consumption are managed within a node. Many of these pools also have queues associated with them, which allow pending requests to be held instead of discarded.

There are several thread pools, but the important ones include:

generic
For generic operations (e.g., background node discovery). Thread pool type is scaling.
index
For index/delete operations. Thread pool type is fixed with a size of # of available processors, queue_size of 200. The maximum size for this pool is 1 + # of available processors.
search
For count/search/suggest operations. Thread pool type is fixed with a size of int((# of available_processors * 3) / 2) + 1, queue_size of 1000.
get
For get operations. Thread pool type is fixed with a size of # of available processors, queue_size of1000.
bulk
For bulk operations. Thread pool type is fixed with a size of # of available processors, queue_size of50. The maximum size for this pool is 1 + # of available processors.
percolate
For percolate operations. Thread pool type is fixed with a size of # of available processors, queue_size of 1000.
snapshot
For snapshot/restore operations. Thread pool type is scaling with a keep-alive of 5m and a max of min(5, (# of available processors)/2).
warmer
For segment warm-up operations. Thread pool type is scaling with a keep-alive of 5m and a max ofmin(5, (# of available processors)/2).
refresh
For refresh operations. Thread pool type is scaling with a keep-alive of 5m and a max of min(10, (# of available processors)/2).
listener
Mainly for java client executing of action when listener threaded is set to true. Thread pool type is scalingwith a default max of min(10, (# of available processors)/2).

Changing a specific thread pool can be done by setting its type-specific parameters; for example, changing the index thread pool to have more threads:

thread_pool:
index:
size: 30

ES断电索引恢复方式

1、Translog异常


这种异常可以通过配置index.engine.force_new_translog: true(默认false)的方式来解决,当断电重启时可以直接回复该异常的index。translog未flush的数据由定时一致性校验任务后期通过校验来恢复。 
问题剖析:断电过程shard对应的translog未写完整,启动加载translog异常,导致对应shard无法正常分配
配置index.engine.force_new_translog: true的代价是,当出现translog丢失或内容异常时,translog中未恢复的数据会丢失
到底会丢失多少最新数据,可以由以下参数来控制:
index.translog.flush_threshold_ops:当发生多少次操作时进行一次flush,默认是 unlimited
index.translog.flush_threshold_size:当translog的大小达到此值时会进行一次flush操作,默认是512mb
index.translog.flush_threshold_period:在指定的时间间隔内如果没有进行flush操作,会进行一次强制flush操作,默认是30min
就是说业务可以根据以上配置了控制一次断电丢失的最大数据量。 
特别要同时关注index.translog.interval配置,该配置为检查上述三种情况的时间间隔,不合理的配置可能导致上述配置无法达到预期,默认5s
       通过设置index.engine.force_new_translog: true进行测试不会再出现translog异常到shard无法分配,且验证丢失数据为最新未flush的数据。

2、非segment文件异常


这种异常为非segment文件损坏,可以通过使用lucene-core-5.3.1.jar(ES_HOME/lib/lucene-core-5.3.1.jar)中的checkIndex工具回复异常。具体操作如下:
java -cp lucene-core-5.3.1.jar -ea:org.apache.lucene... org.apache.lucene.index.CheckIndex /mnt/disk4/data/LOCALCLUSTER/SERVICE-ELASTICSEARCH-9e6f0b06c3f54797a313ab45734c3b1a/SERVICE-ELASTICSEARCH-9e6f0b06c3f54797a313ab45734c3b1a/nodes/0/indices/blacklist_alarm_info-2016071400-2016072100/0/index/ -exorcise
恢复的index中会有数据,丢失的数据需要通过校验来恢复

3、segment文件异常


这种情况最为复杂,无法恢复index中的数据,需要将该index重新allocate,之前该index中的所有数据均会丢失。恢复的步骤如下(以10.17.139.173为例):
第一步:获取unassigned的index:
curl -s "http://10.17.139.173:9200/_cat/shards" | grep UNASSIGNED
第二步:获取node信息:
curl '10.17.139.173:9200/_nodes/process?pretty'
第三步:执行allocate命令:
curl -XPOST '10.17.139.173:9200/_cluster/reroute' -d '{
"commands" : [ {
"allocate" : {
"index" : "snap_image_info-2016070700-2016071400",
"shard" : 0,
"node" : "eE-S5pHPT8yhjtndXHzT7A",
"allow_primary" : true
}
}
]
}'
通过三步可以将index恢复为可用状态。但是该index数据将完全丢失。
下面附上shell脚本方法:
#!/bin/bash 
for index in $(curl -s 'http://localhost:9200/_cat/shards' | grep UNASSIGNED | awk '{print $1}' | sort | uniq); do
for shard in $(curl -s 'http://localhost:9200/_cat/shards' | grep UNASSIGNED | grep $index | awk '{print $2}' | sort | uniq); do
echo $index $shard 
curl -XPOST 'localhost:9200/_cluster/reroute' -d "{
'commands' : [ {
'allocate' : {
'index' : $index,
'shard' : $shard,
'node' : 'Master',
'allow_primary' : true
}
}
]
}" 
sleep 5
done
done 
具体的内容也可以参考下面网页内容:
http://cache.baiducontent.com/c?m=9d78d513d9971ae54fede539514596274201dc346ac0d0643e8ec008c5254f060738ece161645213d2b6617a44ea0c4bea87732b695a77eb8cc8ff158aa6d0756ece6629701e85460fd11eb2cb4738967ec31baff448a6eda372c2f4c5d3a90f128b14523b97f0fc00464b94&p=8b2a971d86cc42af539fc00c554d86&newp=83769a47888111a05bed9f23445f9c231610db2151d4d61e6b82c825d7331b001c3bbfb423231201d3c07f6604ad4258eff13171370825a3dda5c91d9fb4c57479d665&user=baidu&fm=sc&query=fix+unassigned+shards&qid=e4bad4ad0001441e&p1=10

Primary shard failures should not block other primary shard recoveries: 
https://github.com/elastic/elasticsearch/issues/17630

ransportNodesListGatewayStartedShards should fall back to disk based index metadata if not found in cluster state: 
https://github.com/elastic/elasticsearch/pull/17663

Elasticsearch问题总结的更多相关文章

  1. Elasticsearch之java的基本操作一

    摘要   接触ElasticSearch已经有一段了.在这期间,遇到很多问题,但在最后自己的不断探索下解决了这些问题.看到网上或多或少的都有一些介绍ElasticSearch相关知识的文档,但个人觉得 ...

  2. Elasticsearch 5.0 中term 查询和match 查询的认识

    Elasticsearch 5.0 关于term query和match query的认识 一.基本情况 前言:term query和match query牵扯的东西比较多,例如分词器.mapping ...

  3. 以bank account 数据为例,认识elasticsearch query 和 filter

    Elasticsearch 查询语言(Query DSL)认识(一) 一.基本认识 查询子句的行为取决于 query context filter context 也就是执行的是查询(query)还是 ...

  4. Ubuntu 14.04中Elasticsearch集群配置

    Ubuntu 14.04中Elasticsearch集群配置 前言:本文可用于elasticsearch集群搭建参考.细分为elasticsearch.yml配置和系统配置 达到的目的:各台机器配置成 ...

  5. ElasticSearch 5学习(10)——结构化查询(包括新特性)

    之前我们所有的查询都属于命令行查询,但是不利于复杂的查询,而且一般在项目开发中不使用命令行查询方式,只有在调试测试时使用简单命令行查询,但是,如果想要善用搜索,我们必须使用请求体查询(request ...

  6. ElasticSearch 5学习(9)——映射和分析(string类型废弃)

    在ElasticSearch中,存入文档的内容类似于传统数据每个字段一样,都会有一个指定的属性,为了能够把日期字段处理成日期,把数字字段处理成数字,把字符串字段处理成字符串值,Elasticsearc ...

  7. .net Elasticsearch 学习入门笔记

    一. es安装相关1.elasticsearch安装  运行http://localhost:9200/2.head插件3.bigdesk插件安装(安装细节百度:windows elasticsear ...

  8. 自己写的数据交换工具——从Oracle到Elasticsearch

    先说说需求的背景,由于业务数据都在Oracle数据库中,想要对它进行数据的分析会非常非常慢,用传统的数据仓库-->数据集市这种方式,集市层表会非常大,查询的时候如果再做一些group的操作,一个 ...

  9. 如何在Elasticsearch中安装中文分词器(IK+pinyin)

    如果直接使用Elasticsearch的朋友在处理中文内容的搜索时,肯定会遇到很尴尬的问题--中文词语被分成了一个一个的汉字,当用Kibana作图的时候,按照term来分组,结果一个汉字被分成了一组. ...

  10. jar hell & elasticsearch ik 版本问题

    想给es 安装一个ik 的插件, 我的es 是 2.4.0, 下载了一个版本是 1.9.5, [2016-10-09 16:56:26,248][INFO ][node ] [node-2] init ...

随机推荐

  1. 两个已排序数组进行合并后的第K大的值--进军硅谷

    我看到此题时,首先想到一个一个比较遍历过去,这是最暴力的方法,后面我想到了已经排序,那么对每个数组进行二分,然后比较这两个值.此书第三种解法,挺不错,只对那个长度较小的数组进行二分查找,保证i+j-1 ...

  2. 使用emIDE创建STM32项目

    emIDE是一个开源的嵌入式集成开发环境,基于Code::Blocks开发,能够支持多个平台和多个厂家的嵌入式硬件,继承了Code::Blocks的有点. 下载emIDE并安装,也可选择绿色版.若需要 ...

  3. python中如何避免中文是乱码

    这个问题是一个具有很强操作性的问题.我这里有一个经验总结,分享一下,供参考:首先,提倡使用utf-8编码方案,因为它跨平台不错.经验一:在开头声明: # -*- coding: utf-8 -*- 有 ...

  4. linux升级openssh

    升级sshd到OpenSSH-6.7并删除老版本ssh 1)升级前准备 查看是否缺包 # rpm -qa | egrep "gcc|make|perl|pam|pam-devel" ...

  5. jquery input change事件

    input输入框的change事件,要在input失去焦点的时候才会触发 $('input[name=myInput]').change(function() { ... }); 在输入框内容变化的时 ...

  6. @Html.Partial和@Html.Action区别

    1.首先看一下它们的对等关系 @Html.Partial 对应 @{Html.RenderPartial();}@Html.Action 对应 @{Html.RenderAction();} 以上相互 ...

  7. [LeetCode]Lowest Common Ancestor of a Binary Search Tree

    Given a binary search tree (BST), find the lowest common ancestor (LCA) of two given nodes in the BS ...

  8. JupyterNotebook如何添加table of content

    不要总是等待,而是去创造 方法一 ipython notebook升级成了jupyter notebook,在4.x之后的版本,jupyter提供了jupyter-nbextension命令来安装和启 ...

  9. 【CentOS】ifconfig命令 :command not found & yum命令 :cannot find a valid baserl for repo: base/7/x86_64

    第一课,学习的是安装linux的远程连接.相信看阿铭视频的朋友们都会知道,第一个开机输入完root后要敲的命令就是--ifconfig 非常幸运,这迎来了我的第一个问题 这时候你会选择百度,会搜索到这 ...

  10. 关于jqgrid数据不显示问题

    近日有个需求要用到jqgrid,原本用着一切都很顺利,但是在需求变动后,只是修改部分字段名称jqgrid就不显示数据了,后台数据也能传到前台,但是就是不给我显示,到嘴的肉就是没法吃,蛋疼,郁闷都无法形 ...