REdis CPU百分百问题分析
REdis版本:4.0.9
运行环境:Linux 3.10.107 x86_64 gcc_version:4.8.5
结论:是一个BUG,在4.0.11版本中被作者antirez所修复
现象:
1) top显示
2) 执行REdis info命令直接卡住不动
3) 集群通讯端口大量的“CLOSE_WAIT”
4) 日志文件大量的“Bad message length or signature received from Cluster bus”
5) 物理内存和虚拟内存均占用不高(配置最大内存为5G,实际占用物理内存才4M多)
6) 通过其它正常节点查看,该故障节点处于“fail”状态
推测:发生了死循环。
GDB分析:
(gdb) bt #0 je_malloc (size=size@entry=47) at src/jemalloc.c:1425 #1 0x00000000004329ee in zmalloc (size=size@entry=47) at zmalloc.c:98 #2 0x000000000043cea9 in createEmbeddedStringObject (ptr=0x7f5da721b101 "dog:cgi:proj_trans_query_hb", len=27) at object.c:85 #3 0x000000000043cf65 in createStringObject (ptr=ptr@entry=0x7f5da721b101 "dog:cgi:proj_trans_query_hb", len=<optimized out>) at object.c:119 #4 0x000000000044141b in dbRandomKey (db=0x7f5da72d5000) at db.c:236 #5 0x00000000004414c2 in randomkeyCommand (c=0x7f5da6dd9e00) at db.c:498 #6 0x000000000042c03e in call (c=c@entry=0x7f5da6dd9e00, flags=flags@entry=15) at server.c:2229 #7 0x000000000042c6e7 in processCommand (c=0x7f5da6dd9e00) at server.c:2510 #8 0x000000000043b745 in processInputBuffer (c=0x7f5da6dd9e00) at networking.c:1354 #9 0x00000000004267f0 in aeProcessEvents (eventLoop=eventLoop@entry=0x7f5da723a050, flags=flags@entry=11) at ae.c:440 #10 0x0000000000426adb in aeMain (eventLoop=0x7f5da723a050) at ae.c:498 #11 0x00000000004238ef in main (argc=<optimized out>, argv=0x7ffc0451ab58) at server.c:3894 (gdb) f 5 #5 0x00000000004414c2 in randomkeyCommand (c=0x7f5da6dd9e00) at db.c:498 498 db.c: No such file or directory. (gdb) p *c $7 = {id = 144, fd = 78, db = 0x7f5da72d5000, name = 0x0, querybuf = 0x7f5da70ae285 "", pending_querybuf = 0x7f5da7216743 "", querybuf_peak = 0, argc = 1, argv = 0x7f5da72167f0, cmd = 0x741820 <redisCommandTable+7520>, lastcmd = 0x741820 <redisCommandTable+7520>, reqtype = 2, multibulklen = 0, bulklen = -1, reply = 0x7f5da6d20ea0, reply_bytes = 0, sentlen = 0, ctime = 1542039740, lastinteraction = 1542039740, obuf_soft_limit_reached_time = 0, flags = 0, authenticated = 0, replstate = 0, repl_put_online_on_ack = 0, repldbfd = -1496183376, repldboff = 0, repldbsize = 0, replpreamble = 0x5bdb29d9 <Address 0x5bdb29d9 out of bounds>, read_reploff = 0, reploff = 0, repl_ack_off = 0, repl_ack_time = 0, psync_initial_offset = 1090943882, replid = "\000\000\000\000]\177\000\000\000\000\000\000\000\000\000\000\216\000\000\000\000\000\000\000\377\377\377\377\000\000\000\000\000P-\247]\177\000\000", slave_listening_port = 0, slave_ip = "\000\000\000\000\000\000\000\000\243k!\247]\177", '\000' <repeats 14 times>, "]\177\000\000@g!\247]\177\000\000\000\000\000\000\000", slave_capa = 0, mstate = {commands = 0x0, count = 0, minreplicas = -1, minreplicas_timeout = 140040207470240}, btype = 0, bpop = {timeout = 0, keys = 0x7f5da72b2aa0, target = 0x0, numreplicas = 0, reploffset = 0, module_blocked_handle = 0x0}, woff = 0, watched_keys = 0x7f5da6d20e70, pubsub_channels = 0x7f5da72b2b00, pubsub_patterns = 0x7f5da6d20ed0, peerid = 0x0, bufpos = 0, buf = '\000' <repeats 20 times>, "\023.Kp\000\000\000\000\240\016\322\246]\177\000\000\000++\247]\177\000\000\320\016\322\246]\177", '\000' <repeats 14 times>, "$-1\r\n", '\000' <repeats 39 times>, "\240*+\247]\177", '\000' <repeats 34 times>, "\242;\223<\000\000\000\000\000\000\000\000]\177\000\000\000\000\000\000\000\000\000\000\240*+\247]\177", '\000' <repeats 14 times>, "$-1\r", '\000' <repeats 16 times>...} (gdb) p *c->cmd $8 = {name = 0x4f3026 "randomkey", proc = 0x4414b0 <randomkeyCommand>, arity = 1, sflags = 0x4f2e87 "rR", flags = 130, getkeys_proc = 0x0, firstkey = 0, lastkey = 0, keystep = 0, microseconds = 139, calls = 67} (gdb) c Continuing. ^C Program received signal SIGINT, Interrupt. 0x00007f5da782339b in random () from /lib64/libc.so.6 (gdb) bt #0 0x00007f5da782339b in random () from /lib64/libc.so.6 #1 0x0000000000428bf5 in dictGetRandomKey (d=0x7f5da7218360) at dict.c:630 #2 0x00000000004413e0 in dbRandomKey (db=0x7f5da72d5000) at db.c:232 #3 0x00000000004414c2 in randomkeyCommand (c=0x7f5da6dd9e00) at db.c:498 #4 0x000000000042c03e in call (c=c@entry=0x7f5da6dd9e00, flags=flags@entry=15) at server.c:2229 #5 0x000000000042c6e7 in processCommand (c=0x7f5da6dd9e00) at server.c:2510 #6 0x000000000043b745 in processInputBuffer (c=0x7f5da6dd9e00) at networking.c:1354 #7 0x00000000004267f0 in aeProcessEvents (eventLoop=eventLoop@entry=0x7f5da723a050, flags=flags@entry=11) at ae.c:440 #8 0x0000000000426adb in aeMain (eventLoop=0x7f5da723a050) at ae.c:498 #9 0x00000000004238ef in main (argc=<optimized out>, argv=0x7ffc0451ab58) at server.c:3894 (gdb) f 3 #3 0x00000000004414c2 in randomkeyCommand (c=0x7f5da6dd9e00) at db.c:498 498 in db.c (gdb) p *c $9 = {id = 144, fd = 78, db = 0x7f5da72d5000, name = 0x0, querybuf = 0x7f5da70ae285 "", pending_querybuf = 0x7f5da7216743 "", querybuf_peak = 0, argc = 1, argv = 0x7f5da72167f0, cmd = 0x741820 <redisCommandTable+7520>, lastcmd = 0x741820 <redisCommandTable+7520>, reqtype = 2, multibulklen = 0, bulklen = -1, reply = 0x7f5da6d20ea0, reply_bytes = 0, sentlen = 0, ctime = 1542039740, lastinteraction = 1542039740, obuf_soft_limit_reached_time = 0, flags = 0, authenticated = 0, replstate = 0, repl_put_online_on_ack = 0, repldbfd = -1496183376, repldboff = 0, repldbsize = 0, replpreamble = 0x5bdb29d9 <Address 0x5bdb29d9 out of bounds>, read_reploff = 0, reploff = 0, repl_ack_off = 0, repl_ack_time = 0, psync_initial_offset = 1090943882, replid = "\000\000\000\000]\177\000\000\000\000\000\000\000\000\000\000\216\000\000\000\000\000\000\000\377\377\377\377\000\000\000\000\000P-\247]\177\000\000", slave_listening_port = 0, slave_ip = "\000\000\000\000\000\000\000\000\243k!\247]\177", '\000' <repeats 14 times>, "]\177\000\000@g!\247]\177\000\000\000\000\000\000\000", slave_capa = 0, mstate = {commands = 0x0, count = 0, minreplicas = -1, minreplicas_timeout = 140040207470240}, btype = 0, bpop = {timeout = 0, keys = 0x7f5da72b2aa0, target = 0x0, numreplicas = 0, reploffset = 0, module_blocked_handle = 0x0}, woff = 0, watched_keys = 0x7f5da6d20e70, pubsub_channels = 0x7f5da72b2b00, pubsub_patterns = 0x7f5da6d20ed0, peerid = 0x0, bufpos = 0, buf = '\000' <repeats 20 times>, "\023.Kp\000\000\000\000\240\016\322\246]\177\000\000\000++\247]\177\000\000\320\016\322\246]\177", '\000' <repeats 14 times>, "$-1\r\n", '\000' <repeats 39 times>, "\240*+\247]\177", '\000' <repeats 34 times>, "\242;\223<\000\000\000\000\000\000\000\000]\177\000\000\000\000\000\000\000\000\000\000\240*+\247]\177", '\000' <repeats 14 times>, "$-1\r", '\000' <repeats 16 times>...} (gdb) p *c->cmd $10 = {name = 0x4f3026 "randomkey", proc = 0x4414b0 <randomkeyCommand>, arity = 1, sflags = 0x4f2e87 "rR", flags = 130, getkeys_proc = 0x0, firstkey = 0, lastkey = 0, keystep = 0, microseconds = 139, calls = 67} |
从两次不同时间点的数据看,c的地址未发生变化,是同一对象。对照相应版本的源代码,找有循环的地方缩小范围,确定死循环发生的具体函数。
/* Return a random key, in form of a Redis object. * If there are no keys, NULL is returned. * * The function makes sure to return keys not already expired. */ robj *dbRandomKey(redisDb *db) { // db.c:225 dictEntry *de; while(1) { sds key; robj *keyobj; de = dictGetRandomKey(db->dict); // db.c:232 if (de == NULL) return NULL; key = dictGetKey(de); keyobj = createStringObject(key,sdslen(key)); if (dictFind(db->expires,key)) { if (expireIfNeeded(db,keyobj)) { decrRefCount(keyobj); continue; /* search for another key. This expired. */ } } return keyobj; } } void randomkeyCommand(client *c) { // db.c:495 robj *key; if ((key = dbRandomKey(c->db)) == NULL) { // db.c:498 addReply(c,shared.nullbulk); return; } // 在这里打一断点,如果没执行到这, // 即可确定函数dbRandomKey发生死循环 addReplyBulk(c,key); // db.c:503 decrRefCount(key); } /* Return a random entry from the hash table. Useful to * implement randomized algorithms */ dictEntry *dictGetRandomKey(dict *d) // dict.c:610 { dictEntry *he, *orighe; unsigned long h; int listlen, listele; if (dictSize(d) == 0) return NULL; if (dictIsRehashing(d)) _dictRehashStep(d); if (dictIsRehashing(d)) { do { /* We are sure there are no elements in indexes from 0 * to rehashidx-1 */ h = d->rehashidx + (random() % (d->ht[0].size + d->ht[1].size - d->rehashidx)); he = (h >= d->ht[0].size) ? d->ht[1].table[h - d->ht[0].size] : d->ht[0].table[h]; } while(he == NULL); } else { do { h = random() & d->ht[0].sizemask; // dict.c:630 he = d->ht[0].table[h]; } while(he == NULL); } /* Now we found a non empty bucket, but it is a linked * list and we need to get a random element from the list. * The only sane way to do so is counting the elements and * select a random index. */ listlen = 0; orighe = he; while(he) { he = he->next; listlen++; } listele = random() % listlen; he = orighe; while(listele--) he = he->next; return he; } |
经过GDB分析,死循环发生在函数dbRandomKey中,其中的“while(1)”退不出来。亦即走不到退出循环语句“if (de == NULL) return NULL;”。
估计是个BUG,查看新版本(5.0.4)的实现(dictSize在4.0.9和5.0.4两个版本源文件中的位置不变):
#define dictSize(d) ((d)->ht[0].used+(d)->ht[1].used) // dict.h:147 /* Return a random key, in form of a Redis object. * If there are no keys, NULL is returned. * * The function makes sure to return keys not already expired. */ robj *dbRandomKey(redisDb *db) { // db.c:235 dictEntry *de; int maxtries = 100; // 最多重试次数,可消除死循环,但是否起作用,还有两个前置条件 int allvolatile = dictSize(db->dict) == dictSize(db->expires); while(1) { sds key; robj *keyobj; de = dictGetRandomKey(db->dict); if (de == NULL) return NULL; key = dictGetKey(de); keyobj = createStringObject(key,sdslen(key)); if (dictFind(db->expires,key)) { if (allvolatile && server.masterhost && --maxtries == 0) { /* If the DB is composed only of keys with an expire set, * it could happen that all the keys are already logically * expired in the slave, so the function cannot stop because * expireIfNeeded() is false, nor it can stop because * dictGetRandomKey() returns NULL (there are keys to return). * To prevent the infinite loop we do some tries, but if there * are the conditions for an infinite loop, eventually we * return a key name that may be already expired. */ return keyobj; } if (expireIfNeeded(db,keyobj)) { decrRefCount(keyobj); continue; /* search for another key. This expired. */ } } return keyobj; } } |
确定4.0.9版本的“if (allvolatile && server.masterhost && --maxtries == 0) {”是否成立:
(gdb) p db->dict->ht[0].used $11 = 1 (gdb) p db->dict->ht[1].used $12 = 0 (gdb) p server.masterhost $13 = 0x7f5da7221f41 "10.11.34.35" |
显然是可以进入if语句退出循环的,因此可以确定这是一个BUG,并且5.0.4版本已经修复了该问题,往前查4.0.11版本的实现已和5.0.4版本相同,也许更早的版本就已经修复了该问题。
通过查看RELEASENOTES,确定这个BUG是在4.0.11中修复的:
antirez in commit ab145a9f: Fix infinite loop in dbRandomKey(). 1 file changed, 13 insertions(+) |
REdis CPU百分百问题分析的更多相关文章
- redis-server进程CPU百分百问题
结论:待确认是否为redis的BUG,原因是进程实际占用的内存远小于配置的最大内存,所以不会是内存不够需要淘汰.CPU百分百redis-server进程集群状态:slave临时解决办法:使用gdb将d ...
- 记录一次redis cpu异常升高的排插思路
好久没有写博客 现在重新捡起来 记录工作中遇到的问题 方便以后在遇到类似的问题也有一个参考. 背景:有一天生产服务器redis cpu 频繁报警 单核cpu 所以在想是不是业务量上来了. ...
- ELK+redis搭建nginx日志分析平台
ELK+redis搭建nginx日志分析平台发表于 2015-08-19 | 分类于 Linux/Unix | ELK简介ELKStack即Elasticsearch + Logstas ...
- 使用elk+redis搭建nginx日志分析平台
elk+redis 搭建nginx日志分析平台 logstash,elasticsearch,kibana 怎么进行nginx的日志分析呢?首先,架构方面,nginx是有日志文件的,它的每个请求的状态 ...
- Redis sentinel & cluster 原理分析
1. Redis集群实现分析 1.1 sentinel 1. 功能 Sentinel实现如下功能: (1)monitoring--redis实例是否正常运行. (2)notification-- ...
- Db2性能:系统CPU高问题分析的一些思路
Db2性能:系统CPU高问题分析的一些思路 1. 如何判断CPU高? 有很多操作系统的命令可以看出来,比如ps -elf,iostat, vmstat, top/topas, 2. 收集数据 CPU高 ...
- 使用elk+redis搭建nginx日志分析平台(引)
http://www.cnblogs.com/yjf512/p/4199105.html elk+redis 搭建nginx日志分析平台 logstash,elasticsearch,kibana 怎 ...
- Redis 复制原理及分析
1.测试 见master-slave测试帖 2 原理 第一次.Slave向Master同步的实现是: Slave向Master发出同步请求(发送sync命令),Master先dump出rdb文件,然后 ...
- 面试官:CPU百分百!给你一分钟,怎么排查?有几种方法?
Part0 遇到了故障怎么办? 在生产上,我们会遇到各种各样的故障,遇到了故障怎么办? 不要慌,只有冷静才是解决故障的利器. 下面以一个例子为例,在生产中碰到了CPU 100%的问题怎么办? 在生产中 ...
随机推荐
- MySQL如何解决1209错误
1209 - The MySQL server is running with the--read-only option so it cannot execute this sta ...
- 从零开始学spring cloud(五) -------- 将服务注册到Eureka上
一.开发前准备工作: 官方文档地址:https://cloud.spring.io/spring-cloud-static/spring-cloud-netflix/2.1.0.RELEASE/mul ...
- UILabel设置富文本后不显示省略号
先描述一下问题,项目中用到了UILabel去显示一段富文本文字,超过label显示区域部分,省略号处理. 但是当设置好 attributedText 给label之后,显示出的效果是文字被切割了,并没 ...
- 20164319 刘蕴哲 Exp3 免杀原理与实践
[实验内容] 1.1 正确使用msf编码器(0.5分),msfvenom生成如jar之类的其他文件(0.5分),veil-evasion(0.5分),加壳工具(0.5分),使用shellcode编程( ...
- windows安装MongoDB副本集,通过Java程序实现数据的插入与查询
我本地的环境 MongoDB 4.0 jdk 1.7.x 安装参考主要博客 https://blog.csdn.net/wanght89/article/details/77677271#commen ...
- CSRedisCore 在net core中的使用
背景:与net core配套的StackExchange.Redis客户端总是间歇性的发生timeout异常. 由complexer单例对象创建的IDatabase对象,在产生Timeout异常后会导 ...
- 利用mybatis generator实现数据库之间的表同步
项目背景: 项目需要对两个服务器上的表进行同步,表的结构可能不一样.比如服务器A上的表i同步数据到服务器B上的表j,i和j的结构可能不一样,当然大部分字段是一样的.项目看起来很简单,网上一搜也是很多, ...
- 设计模式学习心得<外观模式 Facade>
外观模式(Facade Pattern)隐藏系统的复杂性,并向客户端提供了一个客户端可以访问系统的接口.这种类型的设计模式属于结构型模式,它向现有的系统添加一个接口,来隐藏系统的复杂性. 这种模式涉及 ...
- testXSS <img src="aa" onerror="javascript:alert('XSS');"/>
adsa </p><img src="aa" onerror="javascript:alert('XSS');"/><p> ...
- Python学习:经典编程例题
九九乘法表 ,): ,i+): print(i,'*',j,'=',i*j,end='\t') print() 水仙花数问题描述:100-999之间每个数的立方相加等于原数例如:153=1 ^ 3 + ...