Redis监控工具,命令和调优

1.图形化监控

因为要对Redis做性能测试，发现了GitHub上有个python写的RedisLive监控工具评价不错。结果鼓捣了半天，最后发现其主页中引用了Google的jsapi脚本，必须在线连接谷歌的服务，Stackoverflow上说把js脚本下载到本地也没法解决问题，坑爹！正要放弃时发现了一个从RedisLive fork出去的项目redis-monitor，应该是国人改的吧，去掉了对谷歌jsapi的依赖，并完善了多Redis实例的管理，最终终于看到了久违的曲线图。

首先要保证安装了python。之后下载下列python包安装。可以手动下载tar.gz解压后执行python setup.py install逐一安装，或直接用pip下载：

tornado：一个python的web框架
redis.py：python的redis客户端
python-dateutil
backports.ssl_match_hostname
argparse
setuptools
six

之后从GitHub上下载解压redis-monitor-master，修改src/redis_live.conf。必须配置一个单独的Redis实例存储监控数据，同时可以配置多个要监控的Redis实例。之后启动redis-monitor有些麻烦，需要启动两个前台进程和两个后台进程：

#in src/script/redis-monitor.sh add redis-monitor as a startup service

#start web with port 8888

$ python redis_live.py

# start info collector

$ python redis_monitor.py

#start daemon

$ python redis_live_daemon.py

$ python redis_monitor_daemon.py

2.命令行监控

前面可以看到，虽然图形化监控Redis比较美观、直接，但是安装起来比较麻烦。如果只是想简单看一下Redis的负载情况的话，完全可以用它提供的一些命令来完成。

2.1 吞吐量

Redis提供的INFO命令不仅能够查看实时的吞吐量(ops/sec)，还能看到一些有用的运行时信息。下面用grep过滤出一些比较重要的实时信息，比如已连接的和在阻塞的客户端、已用内存、拒绝连接、实时的tps和数据流量等：

[root@vm redis-3.0.3]# src/redis-cli -h 127.0.0.1 info | grep -e "connected_clients" -e "blocked_clients" -e "used_memory_human" -e "used_memory_peak_human" -e "rejected_connections" -e "evicted_keys" -e "instantaneous"

connected_clients:1

blocked_clients:0

used_memory_human:799.66K

used_memory_peak_human:852.35K

instantaneous_ops_per_sec:0

instantaneous_input_kbps:0.00

instantaneous_output_kbps:0.00

rejected_connections:0

evicted_keys:0

2.2 延迟

2.2.1 客户端PING

从客户端可以监控Redis的延迟，利用Redis提供的PING命令，不断PING服务端，记录服务端响应PONG的时间。下面开两个终端，一个监控延迟，一个监视服务端收到的命令：

[root@vm redis-3.0.3]# src/redis-cli --latency -h 127.0.0.1

min: 0, max: 1, avg: 0.08

[root@vm redis-3.0.3]# src/redis-cli -h 127.0.0.1

127.0.0.1:6379> monitor

OK

1439361594.867170 [0 127.0.0.1:59737] "PING"

1439361594.877413 [0 127.0.0.1:59737] "PING"

1439361594.887643 [0 127.0.0.1:59737] "PING"

1439361594.897858 [0 127.0.0.1:59737] "PING"

1439361594.908063 [0 127.0.0.1:59737] "PING"

1439361594.918277 [0 127.0.0.1:59737] "PING"

1439361594.928469 [0 127.0.0.1:59737] "PING"

1439361594.938693 [0 127.0.0.1:59737] "PING"

1439361594.948899 [0 127.0.0.1:59737] "PING"

1439361594.959110 [0 127.0.0.1:59737] "PING"

如果我们故意用DEBUG命令制造延迟，就能看到一些输出上的变化：

[root@vm redis-3.0.3]# src/redis-cli --latency -h 127.0.0.1

min: 0, max: 1995, avg: 1.60 (2361 samples)

[root@vm redis-3.0.3]# src/redis-cli -h 127.0.0.1

127.0.0.1:6379> debug sleep 1

OK

(1.00s)

127.0.0.1:6379> debug sleep .15

OK

127.0.0.1:6379> debug sleep .5

OK

(0.50s)

127.0.0.1:6379> debug sleep 2

OK

(2.00s)

2.2.2 服务端内部机制

服务端内部的延迟监控稍微麻烦一些，因为延迟记录的默认阈值是0。尽管空间和时间耗费很小，Redis为了高性能还是默认关闭了它。所以首先我们要开启它，设置一个合理的阈值，例如下面命令中设置的100ms：

127.0.0.1:6379> CONFIG SET latency-monitor-threshold 100

OK

因为Redis执行命令非常快，所以我们用DEBUG命令人为制造一些慢执行命令：

127.0.0.1:6379> debug sleep 2

OK

(2.00s)

127.0.0.1:6379> debug sleep .15

OK

127.0.0.1:6379> debug sleep .5

OK

下面就用LATENCY的各种子命令来查看延迟记录：

LATEST：四列分别表示事件名、最近延迟的Unix时间戳、最近的延迟、最大延迟。
HISTORY：延迟的时间序列。可用来产生图形化显示或报表。
GRAPH：以图形化的方式显示。最下面以竖行显示的是指延迟在多久以前发生。
RESET：清除延迟记录。

127.0.0.1:6379> latency latest

1) 1) "command"

   2) (integer) 1439358778

   3) (integer) 500

   4) (integer) 2000

127.0.0.1:6379> latency history command

1) 1) (integer) 1439358773

   2) (integer) 2000

2) 1) (integer) 1439358776

   2) (integer) 150

3) 1) (integer) 1439358778

   2) (integer) 500

127.0.0.1:6379> latency graph command

command - high 2000 ms, low 150 ms (all time high 2000 ms)

--------------------------------------------------------------------------------

#

|

|

|_#

666

mmm

在执行一条DEBUG命令会发现GRAPH图的变化，多出一条新的柱状线，下面的时间2s就是指延迟刚发生两秒钟：

127.0.0.1:6379> debug sleep 1.5

OK

(1.50s)

127.0.0.1:6379> latency graph command

command - high 2000 ms, low 150 ms (all time high 2000 ms)

--------------------------------------------------------------------------------

#

|  #

|  |

|_#|

2222

333s

mmm

还有一个有趣的子命令DOCTOR，它能列出一些指导建议，例如开启慢日志进一步追查问题原因，查看是否有大对象被踢出或过期，以及操作系统的配置建议等。

127.0.0.1:6379> latency doctor

Dave, I have observed latency spikes in this Redis instance. You don't mind talking about it, do you Dave?

1. command: 3 latency spikes (average 883ms, mean deviation 744ms, period 210.00 sec). Worst all time event 2000ms.

I have a few advices for you:

- Check your Slow Log to understand what are the commands you are running which are too slow to execute. Please check http://redis.io/commands/slowlog for more information.

- Deleting, expiring or evicting (because of maxmemory policy) large objects is a blocking operation. If you have very large objects that are often deleted, expired, or evicted, try to fragment those objects into multiple smaller objects.

- I detected a non zero amount of anonymous huge pages used by your process. This creates very serious latency events in different conditions, especially when Redis is persisting on disk. To disable THP support use the command 'echo never > /sys/kernel/mm/transparent_hugepage/enabled', make sure to also add it into /etc/rc.local so that the command will be executed again after a reboot. Note that even if you have already disabled THP, you still need to restart the Redis process to get rid of the huge pages already created.

2.2.3 度量延迟Baseline

延迟中的一部分是来自环境的，比如操作系统内核、虚拟化环境等等。Redis提供了让我们度量这一部分延迟基线(Baseline)的方法：

[root@vm redis-3.0.3]# src/redis-cli --intrinsic-latency 100 -h 127.0.0.1

Max latency so far: 2 microseconds.

Max latency so far: 3 microseconds.

Max latency so far: 26 microseconds.

Max latency so far: 37 microseconds.

Max latency so far: 1179 microseconds.

Max latency so far: 1623 microseconds.

Max latency so far: 1795 microseconds.

Max latency so far: 2142 microseconds.

35818026 total runs (avg latency: 2.7919 microseconds / 27918.90 nanoseconds per run).

Worst run took 767x longer than the average latency.

–intrinsic-latency后面是测试的时长(秒)，一般100秒足够了。

2.3 持续实时监控

Unix的WATCH命令是一个非常实用的工具，它可以实时监视任意命令的输出结果。比如上面我们提到的命令，稍加改造就能变成持续地实时监控工具：

[root@vm redis-3.0.3]# watch -n 1 -d "src/redis-cli -h 127.0.0.1 info | grep -e "connected_clients" -e "blocked_clients" -e "used_memory_human" -e "used_memory_peak_human" -e "rejected_connections" -e "evicted_keys" -e "instantaneous""

Every 1.0s: src/redis-cli -h 127.0.0.1 info | grep -e...  Wed Aug 12 14:30:40 2015

connected_clients:1

blocked_clients:0

used_memory_human:799.66K

used_memory_peak_human:852.35K

instantaneous_ops_per_sec:0

instantaneous_input_kbps:0.01

instantaneous_output_kbps:1.23

rejected_connections:0

evicted_keys:0

[root@vm redis-3.0.3]# watch -n 1 -d "src/redis-cli -h 127.0.0.1 latency graph command"

Every 1.0s: src/redis-cli -h 127.0.0.1 latency graph command                                                                                                               Wed Aug 12 14:33:25 2015

command - high 2000 ms, low 150 ms (all time high 2000 ms)

--------------------------------------------------------------------------------

#

|  #

|  |

|_#|

4441

0006

mmmm

2.4 慢操作日志

像SORT、LREM、SUNION等操作在大对象上会非常耗时，使用时要注意参照官方API上每个命令的算法复杂度。用前面介绍过的慢操作日志监控操作的执行时间。就像主流数据库提供的慢SQL日志一样，Redis也提供了记录慢操作的日志。注意这部分日志只会计算纯粹的操作耗时。

slowlog-log-slower-than设置慢操作的阈值，slowlog-max-len设置保存个数，因为慢操作日志与延迟记录一样，都是保存在内存中的：

127.0.0.1:6379> config set slowlog-log-slower-than 500

OK

127.0.0.1:6379> debug sleep 1

OK

(0.50s)

127.0.0.1:6379> debug sleep .6

OK

127.0.0.1:6379> slowlog get 10

1) 1) (integer) 2

   2) (integer) 1439369937

   3) (integer) 473178

   4) 1) "debug"

      2) "sleep"

      3) ".6"

2) 1) (integer) 1

   2) (integer) 1439369821

   3) (integer) 499357

   4) 1) "debug"

      2) "sleep"

      3) "1"

3) 1) (integer) 0

   2) (integer) 1439365058

   3) (integer) 417846

   4) 1) "debug"

      2) "sleep"

      3) "1"

输出的四列的含义分别是：记录的自增ID、命令执行时的时间戳、命令的执行耗时(ms)、命令的内容。注意上面的DEBUG命令并没有包含休眠时间，而只是命令的处理时间。

3.官方优化建议

3.1 网络延迟

客户端可以通过TCP/IP或Unix域Socket连接到Redis。通常在千兆网络环境中，TCP/IP网络延迟是200us(微秒)，Unix域Socket可以低到30us。关于Unix域Socket(Unix Domain Socket)还是比较常用的技术，具体请参考Nginx+PHP-FPM的域Socket配置方法。

什么是域Socket？

维基百科：“Unix domain socket 或者 IPCsocket 是一种终端，可以使同一台操作系统上的两个或多个进程进行数据通信。与管道相比，Unix domain sockets 既可以使用字节流数和数据队列，而管道通信则只能通过字节流。U**nix domain sockets的接口和Internet socket很像，但它不使用网络底层协议来通信。Unix domain socket的功能是POSIX操作系统里的一种组件。Unix domain sockets使用系统文件的地址来作为自己的身份。它可以被系统进程引用。所以两个进程可以同时打开一个Unix domain sockets来进行通信。不过这种通信方式是发生在系统内核里而不会在网络里传播**。”

网络方面我们能做的就是减少在网络往返时间RTT(Round-Trip Time)。官方提供了以下一些建议：

长连接：不要频繁连接/断开到服务器的连接，尽可能保持长连接(Jedis现在就是这样做的)。
域Socket：如果客户端与Redis服务端在同一台机器上的话，使用Unix域Socket。
多参数命令：相比管道，优先使用多参数命令，如mset/mget/hmset/hmget等。
管道化：其次使用管道减少RTT。
LUA脚本：对于有数据依赖而无法使用管道的命令，可以考虑在Redis服务端执行LUA脚本。

3.2 磁盘I/O

3.2.1 写磁盘

尽管Redis也是基于多路I/O复用的单线程机制，但是却没有像Nginx一样提供CPU Affinity的设置，避免fork出的子进程也跑在Redis主进程依附的CPU内核上，导致后台进程影响主进程。所以还是让操作系统自己去调度Redis主进程和后台进程吧。但反过来，如果不开启持久化机制的话，为Redis设置亲和性是否能进一步提升性能呢？

3.2.2 操作系统Swap

如果系统内存不足，可能会将Redis对应的某些页从内存swap到磁盘文件上。可以通过/proc文件夹中的smaps文件查看是否有数据页被swap。如果发现大量页被swap，则可以用vmstat和iostat进一步追查原因：

[root@vm redis-3.0.3]# src/redis-cli -h 127.0.0.1 info | grep process_id

process_id:24191

[root@vm redis-3.0.3]# cat /proc/24191/smaps | grep "Swap"

Swap:                  0 kB

Swap:                  0 kB

Swap:                  0 kB

Swap:                  0 kB

Swap:                  0 kB

            ...

Swap:                  0 kB

Swap:                  0 kB

Swap:                  0 kB

Swap:                  0 kB

3.3 其他因素

3.3.1 Fork子进程

写RDB文件和rewrite AOF文件都需要fork出一个后台进程，fork操作的主要消耗在于页表的拷贝，不同系统的耗时会有些差异。其中，Xen问题比较严重。

3.3.2 Transparent Huge Page

此外，如果Linux开启了THP(Transparent Huge Page)功能的话，会极大地影响延迟。

3.3.3 Key过期

Redis同时使用主动和被动两种方式剔除已经过期的Key：

被动：当客户端访问到Key时，发现已经过期，则剔除
主动：每100ms剔除一批Key，假如过期Key超过25%则反复执行

所以，要避免同一时间超过25%的Key过期导致的Redis阻塞，设置过期时间时可以稍微随机化一些。

4.最后一招：WatchDog

官方说法提供的最后一招(last resort)就是WatchDog。它能够将慢操作的整个函数执行栈打印到Redis日志中。因为它与前面介绍过的将记录保存在内存中的延迟和满操作记录不同，所以记得使用前要在redis.conf中配置logfile日志路径：

[root@vm redis-3.0.3]# src/redis-cli -h 127.0.0.1

127.0.0.1:6379> CONFIG SET watchdog-period 500

OK

127.0.0.1:6379> debug sleep 1

OK

[root@vm redis-3.0.3]# tailf redis.log

      `-._    `-.__.-'    _.-'

          `-._        _.-'

              `-.__.-'                                               

51091:M 12 Aug 15:36:53.337 # Server started, Redis version 3.0.3

51091:M 12 Aug 15:36:53.338 # WARNING overcommit_memory is set to 0! Background save may fail under low memory condition. To fix this issue add 'vm.overcommit_memory = 1' to /etc/sysctl.conf and then reboot or run the command 'sysctl vm.overcommit_memory=1' for this to take effect.

51091:M 12 Aug 15:36:53.338 # WARNING you have Transparent Huge Pages (THP) support enabled in your kernel. This will create latency and memory usage issues with Redis. To fix this issue run the command 'echo never > /sys/kernel/mm/transparent_hugepage/enabled' as root, and add it to your /etc/rc.local in order to retain the setting after a reboot. Redis must be restarted after THP is disabled.

51091:M 12 Aug 15:36:53.343 # WARNING: The TCP backlog setting of 511 cannot be enforced because /proc/sys/net/core/somaxconn is set to the lower value of 128.

51091:M 12 Aug 15:36:53.343 * DB loaded from disk: 0.000 seconds

51091:M 12 Aug 15:36:53.343 * The server is now ready to accept connections on port 6379

51091:signal-handler (1439365058)

--- WATCHDOG TIMER EXPIRED ---

src/redis-server 127.0.0.1:6379(logStackTrace+0x43)[0x450363]

/lib64/libpthread.so.0(__nanosleep+0x2d)[0x3c0740ef3d]

/lib64/libpthread.so.0[0x3c0740f710]

/lib64/libpthread.so.0[0x3c0740f710]

/lib64/libpthread.so.0(__nanosleep+0x2d)[0x3c0740ef3d]

src/redis-server 127.0.0.1:6379(debugCommand+0x58d)[0x45180d]

src/redis-server 127.0.0.1:6379(call+0x72)[0x4201b2]

src/redis-server 127.0.0.1:6379(processCommand+0x3e5)[0x4207d5]

src/redis-server 127.0.0.1:6379(processInputBuffer+0x4f)[0x42c66f]

src/redis-server 127.0.0.1:6379(readQueryFromClient+0xc2)[0x42c7b2]

src/redis-server 127.0.0.1:6379(aeProcessEvents+0x13c)[0x41a52c]

src/redis-server 127.0.0.1:6379(aeMain+0x2b)[0x41a7eb]

src/redis-server 127.0.0.1:6379(main+0x2cd)[0x423c8d]

/lib64/libc.so.6(__libc_start_main+0xfd)[0x3c0701ed5d]

src/redis-server 127.0.0.1:6379[0x419b49]

51091:signal-handler (51091:signal-handler (1439365058) ) --------

附：参考资料

不得不说，Redis的官方文档写得非常不错！从中能学到很多不只是Redis，还有系统方面的知识。前面推荐大家仔细阅读官方网站上的每个主题。