1. https://blog.mygraphql.com/zh/notes/low-tec/network/tcp-inspect/

写在前面

我不是网络专家,只是在经历了多年的生产和测试环境网络问题排查后,不想再得过且过,于是记录下所学到的知识。由于对 TCP 栈的实现了解有限,所以内容仅作参考。

TCP 连接健康的重要性

TCP 连接健康最少包括:

  • TCP 重传统计,这是网络质量的风向标
  • MTU/MSS 大小,拥挤窗口的大小,这是带宽与吞吐的重要指标
  • 各层收发队列与缓存的统计

这个问题在《从性能问题定位,扯到性能模型,再到 TCP - 都微服务云原生了,还学 TCP 干嘛系列 Part 1》中我聊过,不再重复。

如何查看 TCP 连接健康

Linux 的 TCP 连接健康指标有两种:

  • 整机的统计

    聚合了整机(严格来说,是整个 network namespace 或 整个 container) 的网络健康指标。可用 nstat 查看。

  • 每个 TCP 连接的统计

    每个 TCP 连接均在内核中保存了统计数据。可用 ss 查看。

本文只关注 每个 TCP 连接的统计 ,整机的统计 请到 这篇 查看。

容器化时代

了解过 Linux 下容器化原理的同学应该知道,在内核层都是 namespace + cgroup。而上面说的 TCP 连接健康指标,也是 namespace aware 的。即每个 network namespace 独立统计。在容器化时,什么是 namespace aware,什么不是,一定要分清楚。

曾神秘的 ss

相信很多人用过 netstat。但netstat由于在连接量大时性能不佳的问题,已经慢慢由 ss 代替。如果你好奇 ss 的实现原理,那么转到本文的 “原理” 一节。

参考:https://www.net7.be/blog/article/network_activity_analysis_1_netstat.html

更神秘的无文档指标

ss 简介

ss 是个查看连接明细统计的工具。示例:

  1. 1
  2. 2
  3. 3
  4. 4
  5. 5
  1. $ ss -taoipnm
  2. State Recv-Q Send-Q Local Address:Port Peer Address:Port Process
  3. ESTAB 0 0 159.164.167.179:55124 149.139.16.235:9042 users:(("envoy",pid=81281,fd=50))
  4. skmem:(r0,rb131072,t0,tb87040,f0,w0,o0,bl0,d13) ts sack cubic wscale:9,7 rto:204 rtt:0.689/0.065 ato:40 mss:1448 pmtu:9000 rcvmss:610 advmss:8948 cwnd:10 bytes_sent:3639 bytes_retrans:229974096 bytes_acked:3640 bytes_received:18364 segs_out:319 segs_in:163 data_segs_out:159 data_segs_in:159 send 168.1Mbps lastsnd:16960 las
  5. trcv:16960 lastack:16960 pacing_rate 336.2Mbps delivery_rate 72.4Mbps delivered:160 app_limited busy:84ms retrans:0/25813 rcv_rtt:1 rcv_space:62720 rcv_ssthresh:56588 minrtt:0.16

详细见手册:https://man7.org/linux/man-pages/man8/ss.8.html

字段说明

️ 我不是网络专家,以下说明是我最近的一些学习结果,不排除有错。请谨慎使用。

Recv-Q与Send-Q

  • 当socket是listen 状态(eg: ss -lnt)
    Recv-Q: 全连接队列的大小,也就是当前已完成三次握手并等待服务端 accept() 的 TCP 连接
    Send-Q: 全连接最大队列长度
  • 当socket 是非listen 状态(eg: ss -nt)
    Recv-Q: 未被应用进程读取的字节数;
    Send-Q: 已发送但未收到确认的字节数;

Recv-Q

Established: The count of bytes not copied by the user program connected to this socket.

Listening: Since Kernel 2.6.18 this column contains the current syn backlog.

Send-Q

Established: The count of bytes not acknowledged by the remote host.

Listening: Since Kernel 2.6.18 this column contains the maximum size of the syn backlog.

基本信息

  • ts 连接是否包含时间截。

    show string “ts” if the timestamp option is set

  • sack 连接时否打开 sack

    show string “sack” if the sack option is set

  • cubic 拥挤窗口算法名。

    congestion algorithm name

  • wscale:<snd_wscale>:<rcv_wscale> 发送与接收窗口大小的放大系数。因 19xx 年代时,网络和计算机资源有限,当时制订的 TCP 协议留给窗口大小的字段取值范围很小。到现在高带宽时代,需要一个放大系数才可能有大窗口。

    if window scale option is used, this field shows the send scale factor and receive scale factor.

  • rto 动态计算出的 TCP 重传用的超时参数,单位毫秒。

    tcp re-transmission timeout value, the unit is millisecond.

  • rtt:<rtt>/<rttvar> RTT,测量与估算出的一个IP包发送对端和反射回来的用时。rtt 是平均值,rttvar 是中位数。

    rtt is the average round trip time, rttvar is the mean deviation of rtt, their units are millisecond.

  • ato:<ato> delay ack 超时时间。

    ack timeout, unit is millisecond, used for delay ack mode.

其它:

  1. bytes_acked:<bytes_acked>
  2. bytes acked
  3. bytes_received:<bytes_received>
  4. bytes received
  5. segs_out:<segs_out>
  6. segments sent out
  7. segs_in:<segs_in>
  8. segments received
  9. send <send_bps>bps
  10. egress bps
  11. lastsnd:<lastsnd>
  12. how long time since the last packet sent, the unit
  13. is millisecond
  14. lastrcv:<lastrcv>
  15. how long time since the last packet received, the
  16. unit is millisecond
  17. lastack:<lastack>
  18. how long time since the last ack received, the unit
  19. is millisecond
  20. pacing_rate <pacing_rate>bps/<max_pacing_rate>bps
  21. the pacing rate and max pacing rate

内存/TCP Window/TCP Buffer 相关

  1. ESTAB 0 0 192.168.1.14:43674 192.168.1.17:1080 users:(("chrome",pid=3387,fd=66)) timer:(keepalive,27sec,0)
  2. skmem:(r0,rb131072,t0,tb87040,f0,w0,o0,bl0,d13) ts sack cubic wscale:7,7 rto:204 rtt:3.482/6.013 ato:40 mss:1448 pmtu:1500 rcvmss:1448 advmss:1448 cwnd:10 bytes_sent:2317 bytes_acked:2318 bytes_received:2960 segs_out:36 segs_in:34 data_segs_out:8 data_segs_in:9 send 33268237bps lastsnd:200048 lastrcv:199596 lastack:17596 pacing_rate 66522144bps delivery_rate 31911840bps delivered:9 app_limited busy:48ms rcv_space:14480 rcv_ssthresh:64088 minrtt:0.408

skmem

https://man7.org/linux/man-pages/man8/ss.8.html

  1. skmem:(r<rmem_alloc>,rb<rcv_buf>,t<wmem_alloc>,tb<snd_buf>,
  2. f<fwd_alloc>,w<wmem_queued>,o<opt_mem>,
  3. bl<back_log>,d<sock_drop>)
  4. <rmem_alloc>
  5. the memory allocated for receiving packet
  6. <rcv_buf>
  7. the total memory can be allocated for receiving
  8. packet
  9. <wmem_alloc>
  10. the memory used for sending packet (which has been
  11. sent to layer 3)
  12. <snd_buf>
  13. the total memory can be allocated for sending
  14. packet
  15. <fwd_alloc>
  16. the memory allocated by the socket as cache, but
  17. not used for receiving/sending packet yet. If need
  18. memory to send/receive packet, the memory in this
  19. cache will be used before allocate additional
  20. memory.
  21. <wmem_queued>
  22. The memory allocated for sending packet (which has
  23. not been sent to layer 3)
  24. <ropt_mem>
  25. The memory used for storing socket option, e.g.,
  26. the key for TCP MD5 signature
  27. <back_log>
  28. The memory used for the sk backlog queue. On a
  29. process context, if the process is receiving
  30. packet, and a new packet is received, it will be
  31. put into the sk backlog queue, so it can be
  32. received by the process immediately
  33. <sock_drop>
  34. the number of packets dropped before they are de-
  35. multiplexed into the socket

https://blog.cloudflare.com/optimizing-tcp-for-high-throughput-and-low-latency/#:~:text=is%20Linux%20autotuning.-,Linux%20autotuning,-Linux%20autotuning%20is

  • skmem_r

    is the actual amount of memory that is allocated, which includes not only user payload (Recv-Q) but also additional memory needed by Linux to process the packet (packet metadata). This is known within the kernel as sk_rmem_alloc.

    如果应用层能及时消费 TCP 内核层接收到的数据,这个数字基本为 0。

    Note that there are other buffers associated with a socket, so skmem_r does not represent the total memory that a socket might have allocated.

  • skmem_rb

    is the maximum amount of memory that could be allocated by the socket for the receive buffer. This is higher than rcv_ssthresh to account for memory needed for packet processing that is not packet data. Autotuning can increase this value (up to tcp_rmem max) based on how fast the L7 application is able to read data from the socket and the RTT of the session. This is known within the kernel as sk_rcvbuf.

rcv_space

  1. rcv_space:<rcv_space>
  2. a helper variable for TCP internal auto tuning
  3. socket receive buffer

https://blog.cloudflare.com/optimizing-tcp-for-high-throughput-and-low-latency/#:~:text=is%20Linux%20autotuning.-,Linux%20autotuning,-Linux%20autotuning%20is

rcv_space is the high water mark of the rate of the local application reading from the receive buffer during any RTT. This is used internally within the kernel to adjust sk_rcvbuf.

http://darenmatthews.com/blog/?p=2106#:~:text=%E2%80%9D-,rcv_space,-is%20used%20in

rcv_space is used in TCP’s internal auto-tuning to grow socket buffers based on how much data the kernel estimates the sender can send. It will change over the life of any connection. It’s measured in bytes. You can see where the value is populated by reading the tcp_get_info() function in the kernel.

The value is not measuring the actual socket buffer size, which is what net.ipv4.tcp_rmem controls. You’d need to call getsockopt() within the application to check the buffer size. You can see current buffer usage with the Recv-Q and Send-Q fields of ss.
Note that if the buffer size is set with setsockopt(), the value returned with getsockopt() is always double the size requested to allow for overhead. This is described in man 7 socket.

rcv_ssthresh

https://blog.cloudflare.com/optimizing-tcp-for-high-throughput-and-low-latency/#:~:text=is%20Linux%20autotuning.-,Linux%20autotuning,-Linux%20autotuning%20is

rcv_ssthresh is the window clamp, a.k.a. the maximum receive window size. This value is not known to the sender. The sender receives only the current window size, via the TCP header field. A closely-related field in the kernel, tp->window_clamp, is the maximum window size allowable based on the amount of available memory. rcv_ssthresh is the receiver-side slow-start threshold value.

以下用一个例子,说明缓存大小与配置关系:

  1. 1
  2. 2
  3. 3
  4. 4
  5. 5
  6. 6
  7. 7
  8. 8
  9. 9
  10. 10
  11. 11
  12. 12
  13. 13
  14. 14
  15. 15
  16. 16
  17. 17
  18. 18
  19. 19
  20. 20
  21. 21
  22. 22
  23. 23
  24. 24
  25. 25
  26. 26
  27. 27
  28. 28
  29. 29
  30. 30
  31. 31
  32. 32
  33. 33
  34. 34
  35. 35
  36. 36
  37. 37
  38. 38
  39. 39
  40. 40
  41. 41
  42. 42
  43. 43
  44. 44
  45. 45
  46. 46
  47. 47
  48. 48
  49. 49
  50. 50
  51. 51
  52. 52
  53. 53
  54. 54
  55. 55
  1. $ sudo sysctl -a | grep tcp
  2. net.ipv4.tcp_base_mss = 1024
  3. net.ipv4.tcp_keepalive_intvl = 75
  4. net.ipv4.tcp_keepalive_probes = 9
  5. net.ipv4.tcp_keepalive_time = 7200
  6. net.ipv4.tcp_max_syn_backlog = 4096
  7. net.ipv4.tcp_max_tw_buckets = 262144
  8. net.ipv4.tcp_mem = 766944 1022593 1533888 page)
  9. net.ipv4.tcp_moderate_rcvbuf = 1
  10. net.ipv4.tcp_retries1 = 3
  11. net.ipv4.tcp_retries2 = 15
  12. net.ipv4.tcp_rfc1337 = 0
  13. net.ipv4.tcp_rmem = 4096 131072 6291456
  14. net.ipv4.tcp_adv_win_scale = 1 memory in receive buffer as TCP window size)
  15. net.ipv4.tcp_syn_retries = 6
  16. net.ipv4.tcp_synack_retries = 5
  17. net.ipv4.tcp_timestamps = 1
  18. net.ipv4.tcp_window_scaling = 1
  19. net.ipv4.tcp_wmem = 4096 16384 4194304
  20. net.core.rmem_default = 212992
  21. net.core.rmem_max = 212992
  22. net.core.wmem_default = 212992
  23. net.core.wmem_max = 212992
  24. $ ss -taoipnm 'dst 100.225.237.27'
  25. ESTAB 0 0 192.168.1.14:57174 100.225.237.27:28101 users:(("ssh",pid=49183,fd=3)) timer:(keepalive,119min,0)
  26. skmem:(r0,rb131072,t0,tb87040,f0,w0,o0,bl0,d0) ts sack cubic wscale:7,7 rto:376 rtt:165.268/11.95 ato:40 mss:1440 pmtu:1500 rcvmss:1080 advmss:1448 cwnd:10 bytes_sent:5384 bytes_retrans:1440 bytes_acked:3945 bytes_received:3913 segs_out:24 segs_in:23 data_segs_out:12 data_segs_in:16 send 697050bps lastsnd:53864 lastrcv:53628 lastack:53704 pacing_rate 1394088bps delivery_rate 73144bps delivered:13 busy:1864ms retrans:0/1 dsack_dups:1 rcv_rtt:163 rcv_space:14480 rcv_ssthresh:64088 minrtt:157.486
  27. #可见: rb131072 = net.ipv4.tcp_rmem[1] = 131072
  28. ###############停止接收端应用进程,让接收端内核层 Buffer 満####################
  29. $ export PID=49183
  30. $ kill -STOP $PID
  31. $ ss -taoipnm 'dst 100.225.237.27'
  32. State Recv-Q Send-Q Local Address:Port Peer Address:Port Process
  33. ESTAB 0 0 192.168.1.14:57174 100.225.237.27:28101 users:(("ssh",pid=49183,fd=3)) timer:(keepalive,115min,0)
  34. skmem:(r24448,rb131072,t0,tb87040,f4224,w0,o0,bl0,d4) ts sack cubic wscale:7,7 rto:376 rtt:174.381/20.448 ato:40 mss:1440 pmtu:1500 rcvmss:1440 advmss:1448 cwnd:10 bytes_sent:6456 bytes_retrans:1440 bytes_acked:5017 bytes_received:971285 segs_out:1152 segs_in:2519 data_segs_out:38 data_segs_in:2496 send 660622bps lastsnd:1456 lastrcv:296 lastack:24 pacing_rate 1321240bps delivery_rate 111296bps delivered:39 app_limited busy:6092ms retrans:0/1 dsack_dups:1 rcv_rtt:171.255 rcv_space:14876 rcv_ssthresh:64088 minrtt:157.126
  35. #可见: 首次出现 app_limited
  36. ###################################
  37. $ ss -taoipnm 'dst 100.225.237.27'
  38. State Recv-Q Send-Q Local Address:Port Peer Address:Port Process
  39. ESTAB 67788 0 192.168.1.14:57174 100.225.237.27:28101 users:(("ssh",pid=49183,fd=3)) timer:(keepalive,115min,0)
  40. skmem:(r252544,rb250624,t0,tb87040,f1408,w0,o0,bl0,d6) ts sack cubic wscale:7,7 rto:376 rtt:173.666/18.175 ato:160 mss:1440 pmtu:1500 rcvmss:1440 advmss:1448 cwnd:10 bytes_sent:6600 bytes_retrans:1440 bytes_acked:5161 bytes_received:1292017 segs_out:1507 segs_in:3368 data_segs_out:42 data_segs_in:3340 send 663342bps lastsnd:9372 lastrcv:1636 lastack:1636 pacing_rate 1326680bps delivery_rate 111296bps delivered:43 app_limited busy:6784ms retrans:0/1 dsack_dups:1 rcv_rtt:169.162 rcv_space:14876 rcv_ssthresh:64088 minrtt:157.126
  41. #可见:r252544 rb250624 在增长。Recv-Q = 67788 表示 TCP窗口大小是 67788(bytes)。因 net.ipv4.tcp_adv_win_scale = 1,即 ½ 接收缓存用于 TCP window,即 接收缓存 = 67788 * 2 = 135576(bytes)
  42. ###################################
  43. $ kill -CONT $PID
  44. $ ss -taoipnm 'dst 100.225.237.27'
  45. State Recv-Q Send-Q Local Address:Port Peer Address:Port Process
  46. ESTAB 0 0 192.168.1.14:57174 100.225.237.27:28101 users:(("ssh",pid=49183,fd=3)) timer:(keepalive,105min,0)
  47. skmem:(r14720,rb6291456,t0,tb87040,f1664,w0,o0,bl0,d15) ts sack cubic wscale:7,7 rto:368 rtt:165.199/7.636 ato:40 mss:1440 pmtu:1500 rcvmss:1440 advmss:1448 cwnd:10 bytes_sent:7356 bytes_retrans:1440 bytes_acked:5917 bytes_received:2981085 segs_out:2571 segs_in:5573 data_segs_out:62 data_segs_in:5524 send 697341bps lastsnd:2024 lastrcv:280 lastack:68 pacing_rate 1394672bps delivery_rate 175992bps delivered:63 app_limited busy:9372ms retrans:0/1 dsack_dups:1 rcv_rtt:164.449 rcv_space:531360 rcv_ssthresh:1663344 minrtt:157.464
  48. #可见: rb6291456 = net.ipv4.tcp_rmem[2] = 6291456

MTU/MSS 相关

mss

连接当前使用的,用于限制发送报文大小的 MSS。current effective sending MSS.

https://github.com/CumulusNetworks/iproute2/blob/6335c5ff67202cf5b39eb929e2a0a5bb133627ba/misc/ss.c#L2206

  1. 1
  1. s.mss = info->tcpi_snd_mss

https://elixir.bootlin.com/linux/v5.4/source/net/ipv4/tcp.c#L3258

  1. 1
  1. info->tcpi_snd_mss = tp->mss_cache;

https://elixir.bootlin.com/linux/v5.4/source/net/ipv4/tcp_output.c#L1576

  1. 1
  2. 2
  3. 3
  4. 4
  5. 5
  6. 6
  7. 7
  8. 8
  9. 9
  10. 10
  11. 11
  12. 12
  13. 13
  14. 14
  1. /*
  2. tp->mss_cache is current effective sending mss, including
  3. all tcp options except for SACKs. It is evaluated,
  4. taking into account current pmtu, but never exceeds
  5. tp->rx_opt.mss_clamp.
  6. ...
  7. */
  8. unsigned int tcp_sync_mss(struct sock *sk, u32 pmtu)
  9. {
  10. ...
  11. tp->mss_cache = mss_now;
  12. return mss_now;
  13. }

advmss

连接建立时,由本机发送出的 SYN 报文中,包含的 MSS Option。其目标是在建立连接时,就告诉对端本机可以接收的最大报文大小。Advertised MSS by the host when conection started(in SYN packet).

https://elixir.bootlin.com/linux/v5.4/source/include/linux/tcp.h#L217

pmtu

通过 Path MTU Discovery 发现到的对端 MTU 。Path MTU value.

这里有几点注意的:

  • Linux 会把每个测量过的对端 IP 的 MTU 值缓存到 Route Cache,这可以避免相同对端重复走 Path MTU Discovery 流程
  • Path MTU Discovery 在 Linux 中有两种不同的实现方法
    • 传统基于 ICMP 的 RFC1191

      • 但现在很多路由和 NAT 不能正确处理 ICMP
    • Packetization Layer Path MTU Discovery (PLPMTUD, RFC 4821 and RFC 8899)

https://github.com/shemminger/iproute2/blob/f8decf82af07591833f89004e9b72cc39c1b5c52/misc/ss.c#L3075

  1. 1
  1. s.pmtu = info->tcpi_pmtu;

https://elixir.bootlin.com/linux/v5.4/source/net/ipv4/tcp.c#L3272

  1. 1
  1. info->tcpi_pmtu = icsk->icsk_pmtu_cookie;

https://elixir.bootlin.com/linux/v5.4/source/include/net/inet_connection_sock.h#L96

  1. 1
  2. 2
  3. 3
  4. 4
  1. //@icsk_pmtu_cookie Last pmtu seen by socket
  2. struct inet_connection_sock {
  3. ...
  4. __u32 icsk_pmtu_cookie;

https://elixir.bootlin.com/linux/v5.4/source/net/ipv4/tcp_output.c#L1573

  1. 1
  2. 2
  3. 3
  4. 4
  1. unsigned int tcp_sync_mss(struct sock *sk, u32 pmtu)
  2. {
  3. /* And store cached results */
  4. icsk->icsk_pmtu_cookie = pmtu;

https://elixir.bootlin.com/linux/v5.4/source/net/ipv4/tcp_input.c#L2587

https://elixir.bootlin.com/linux/v5.4/source/net/ipv4/tcp_ipv4.c#L362

https://elixir.bootlin.com/linux/v5.4/source/net/ipv4/tcp_timer.c#L161

rcvmss

老实说,这个我没看明白。一些参考:

MSS used for delayed ACK decisions.

https://elixir.bootlin.com/linux/v5.4/source/include/net/inet_connection_sock.h#L122

  1. 1
  1. __u16 rcv_mss; /* MSS used for delayed ACK decisions */

https://elixir.bootlin.com/linux/v5.4/source/net/ipv4/tcp_input.c#L502

  1. 1
  2. 2
  3. 3
  4. 4
  5. 5
  6. 6
  7. 7
  8. 8
  9. 9
  10. 10
  11. 11
  12. 12
  13. 13
  14. 14
  15. 15
  16. 16
  17. 17
  18. 18
  1. /* Initialize RCV_MSS value.
  2. * RCV_MSS is an our guess about MSS used by the peer.
  3. * We haven't any direct information about the MSS.
  4. * It's better to underestimate the RCV_MSS rather than overestimate.
  5. * Overestimations make us ACKing less frequently than needed.
  6. * Underestimations are more easy to detect and fix by tcp_measure_rcv_mss().
  7. */
  8. void tcp_initialize_rcv_mss(struct sock *sk)
  9. {
  10. const struct tcp_sock *tp = tcp_sk(sk);
  11. unsigned int hint = min_t(unsigned int, tp->advmss, tp->mss_cache);
  12. hint = min(hint, tp->rcv_wnd / 2);
  13. hint = min(hint, TCP_MSS_DEFAULT);
  14. hint = max(hint, TCP_MIN_MSS);
  15. inet_csk(sk)->icsk_ack.rcv_mss = hint;
  16. }

Flow control 流控

cwnd

cwnd: 拥塞窗口大小。congestion window size

https://en.wikipedia.org/wiki/TCP_congestion_control#:~:text=set%20to%20a-,small%20multiple,-of%20the%20maximum

拥塞窗口字节大小 = cwnd * mss.

ssthresh

在本机TCP层检测到网络拥塞发生后,会缩小拥塞窗口到最少値,然后尝试快速增加,回到 ssthresh * mss 个字节。

  1. ssthresh:<ssthresh>
  2. tcp congestion window slow start threshold

ssthresh 的计算逻辑见:

https://witestlab.poly.edu/blog/tcp-congestion-control-basics/#:~:text=Overview%20of%20TCP%20phases

retrans 重传相关

retrans

TCP 重传统计。格式为:

重传且未收到 ack 的 segment 数 / 整个连接的总重传 segment 次数。

https://unix.stackexchange.com/questions/542712/detailed-output-of-ss-command

(Retransmitted packets out) / (Total retransmits for entire connection)

add more TCP_INFO components

retrans:X/Y

  1. X: number of outstanding retransmit packets

​ Y: total number of retransmits for the session

  • s.retrans_total

https://github.com/shemminger/iproute2/blob/f8decf82af07591833f89004e9b72cc39c1b5c52/misc/ss.c#L3068

  1. 1
  1. s.retrans_total = info->tcpi_total_retrans;

https://elixir.bootlin.com/linux/v5.19/source/include/uapi/linux/tcp.h#L232

  1. 1
  2. 2
  3. 3
  1. struct tcp_info {
  2. __u32 tcpi_retrans;
  3. __u32 tcpi_total_retrans;

https://elixir.bootlin.com/linux/v5.19/source/net/ipv4/tcp.c#L3791

  1. 1
  1. info->tcpi_total_retrans = tp->total_retrans;

https://elixir.bootlin.com/linux/v5.19/source/include/linux/tcp.h#L347

  1. 1
  2. 2
  1. struct tcp_sock {
  2. u32 total_retrans; /* Total retransmits for entire connection */
  • s.retrans

重传且未收到 ack 的 segment 数

https://github.com/shemminger/iproute2/blob/f8decf82af07591833f89004e9b72cc39c1b5c52/misc/ss.c#L3068

  1. 1
  1. s.retrans = info->tcpi_retrans;

https://elixir.bootlin.com/linux/v5.19/source/net/ipv4/tcp.c#L3774

  1. 1
  1. info->tcpi_retrans = tp->retrans_out;

https://elixir.bootlin.com/linux/v5.19/source/include/linux/tcp.h#L266

  1. 1
  2. 2
  1. struct tcp_sock {
  2. u32 retrans_out; /* Retransmitted packets out */

bytes_retrans

重传输的总数据字节数。Total data bytes retransmitted

timer 定时器

初入門 TCP 实现的同学,很难想像, TCP 除了输入与输出事件驱动外,其实还由很多定时器去驱动的。ss 可以查看这些定时器。

  1. Show timer information. For TCP protocol, the output
  2. format is:
  3. timer:(<timer_name>,<expire_time>,<retrans>)
  4. <timer_name>
  5. the name of the timer, there are five kind of timer
  6. names:
  7. on : means one of these timers: TCP retrans timer,
  8. TCP early retrans timer and tail loss probe timer
  9. keepalive: tcp keep alive timer
  10. timewait: timewait stage timer
  11. persist: zero window probe timer
  12. unknown: none of the above timers
  13. <expire_time>
  14. how long time the timer will expire

Other

app_limited

https://unix.stackexchange.com/questions/542712/detailed-output-of-ss-command

limit TCP flows with application-limiting in request or responses. 我理解是,这是个 boolean,如果 ss 显示了 app_limited 这个标记,表示应用未完全使用所有 TCP 发送带宽,即,连接还有余力发送更多。

  1. tcpi_delivery_rate: The most recent goodput, as measured by
  2. tcp_rate_gen(). If the socket is limited by the sending
  3. application (e.g., no data to send), it reports the highest
  4. measurement instead of the most recent. The unit is bytes per
  5. second (like other rate fields in tcp_info).
  6. tcpi_delivery_rate_app_limited: A boolean indicating if the goodput
  7. was measured when the socket's throughput was limited by the
  8. sending application.

https://github.com/shemminger/iproute2/blob/f8decf82af07591833f89004e9b72cc39c1b5c52/misc/ss.c#L3138

  1. 1
  1. s.app_limited = info->tcpi_delivery_rate_app_limited;

https://elixir.bootlin.com/linux/v5.4/source/net/ipv4/tcp_rate.c#L182

  1. 1
  2. 2
  3. 3
  4. 4
  5. 5
  6. 6
  7. 7
  8. 8
  9. 9
  10. 10
  11. 11
  12. 12
  13. 13
  14. 14
  15. 15
  16. 16
  1. /* If a gap is detected between sends, mark the socket application-limited. */
  2. void tcp_rate_check_app_limited(struct sock *sk)
  3. {
  4. struct tcp_sock *tp = tcp_sk(sk);
  5. if (/* We have less than one packet to send. */
  6. tp->write_seq - tp->snd_nxt < tp->mss_cache &&
  7. /* Nothing in sending host's qdisc queues or NIC tx queue. */
  8. sk_wmem_alloc_get(sk) < SKB_TRUESIZE(1) &&
  9. /* We are not limited by CWND. */
  10. tcp_packets_in_flight(tp) < tp->snd_cwnd &&
  11. /* All lost packets have been retransmitted. */
  12. tp->lost_out <= tp->retrans_out)
  13. tp->app_limited =
  14. (tp->delivered + tcp_packets_in_flight(tp)) ? : 1;
  15. }

特别操作

specified network namespace

指定 ss 用的 network namespace 文件,如 ss -N /proc/322/ns/net

  1. -N NSNAME, --net=NSNAME
  2. Switch to the specified network namespace name.

kill socket

强制关闭 TCP 连接。

  1. -K, --kill
  2. Attempts to forcibly close sockets. This option displays
  3. sockets that are successfully closed and silently skips
  4. sockets that the kernel does not support closing. It
  5. supports IPv4 and IPv6 sockets only.
  1. 1
  1. sudo ss -K 'dport 22'

监听连接关闭事件

  1. ss -ta -E
  2. State Recv-Q Send-Q Local Address:Port Peer Address:Port Process
  3. UNCONN 0 0 10.0.2.15:40612 172.67.141.218:http

过滤器

如:

  1. 1
  1. ss -apu state unconnected 'sport = :1812'

监控使用例子

非容器化的例子:

  1. 1
  2. 2
  3. 3
  4. 4
  5. 5
  6. 6
  7. 7
  8. 8
  9. 9
  10. 10
  11. 11
  12. 12
  13. 13
  14. 14
  1. #non-container version
  2. export JMETER_PID=38991 # PLEASE UPDATE
  3. export SS_FILTER="dst 1.1.1.1" # PLEASE UPDATE, e.g IP of the gateway to k8s
  4. export CAPTURE_SECONDS=60000 #capture for 1000 minutes
  5. sudo bash -c "
  6. end=\$((SECONDS+$CAPTURE_SECONDS))
  7. while [ \$SECONDS -lt \$end ]; do
  8. echo \$SECONDS/\$end
  9. ss -taoipnm \"${SS_FILTER}\" | grep -A1 $JMETER_PID
  10. sleep 2
  11. date
  12. done
  13. " | tee /tmp/tcp_conn_info_${JMETER_PID}

容器化例子:

  1. 1
  2. 2
  3. 3
  4. 4
  5. 5
  6. 6
  7. 7
  8. 8
  9. 9
  10. 10
  11. 11
  12. 12
  13. 13
  14. 14
  15. 15
  16. 16
  17. 17
  1. export ENVOY_PID=$(sudo pgrep --ns $SVC_PID --nslist net envoy)
  2. export SS_FILTER="dst 1.1.1.1 or dst 2.2.2.2" # PLEASE UPDATE, e.g IP of the O
  3. racle/Cassandra/Kafka/Redis
  4. export POD_NAME=$(sudo nsenter -t $ENVOY_PID -n -u -- hostname)
  5. ## capture connection info for 10 minutes
  6. export CAPTURE_SECONDS=600 #capture for 10 min
  7. sudo nsenter -t $ENVOY_PID -n -u -- bash -c "
  8. end=\$((SECONDS+$CAPTURE_SECONDS))
  9. while [ \$SECONDS -lt \$end ]; do
  10. echo \$SECONDS/\$end
  11. ss -taoipnm \"${SS_FILTER}\" | grep -A1 $ENVOY_PID
  12. sleep 1
  13. date
  14. done
  15. " | tee /tmp/tcp_conn_info_${POD_NAME}

原理

Netlink

https://events.static.linuxfound.org/sites/events/files/slides/Exploration%20of%20Linux%20Container%20Network%20Monitoring%20and%20Visualization.pdf

https://man7.org/linux/man-pages/man7/netlink.7.html

  1. 1
  2. 2
  3. 3
  4. 4
  5. 5
  6. 6
  1. socket(AF_NETLINK, SOCK_RAW, NETLINK_INET_DIAG);
  2. /**
  3. NETLINK_SOCK_DIAG (since Linux 3.3)
  4. Query information about sockets of various protocol
  5. families from the kernel (see sock_diag(7)).
  6. **/
  • Fetch information about sockets - Used by ss (“another utility to investigate sockets”)

NETLINK_INET_DIAG

https://man7.org/linux/man-pages/man7/sock_diag.7.ht

idiag_ext

这里可以看看 ss 的数据源。就是另一个侧面的文档了。

https://man7.org/linux/man-pages/man7/sock_diag.7.html#:~:text=or%0A%20%20%20%20%20%20%20%20%20%20%20%20%20%20IPPROTO_UDPLITE.-,idiag_ext,-This%20is%20a

  1. The fields of struct inet_diag_req_v2 are as follows:
  2. idiag_ext
  3. This is a set of flags defining what kind of extended
  4. information to report. Each requested kind of information
  5. is reported back as a netlink attribute as described
  6. below:
  7. INET_DIAG_TOS
  8. The payload associated with this attribute is a
  9. __u8 value which is the TOS of the socket.
  10. INET_DIAG_TCLASS
  11. The payload associated with this attribute is a
  12. __u8 value which is the TClass of the socket. IPv6
  13. sockets only. For LISTEN and CLOSE sockets, this
  14. is followed by INET_DIAG_SKV6ONLY attribute with
  15. associated __u8 payload value meaning whether the
  16. socket is IPv6-only or not.
  17. INET_DIAG_MEMINFO
  18. The payload associated with this attribute is
  19. represented in the following structure:
  20. struct inet_diag_meminfo {
  21. __u32 idiag_rmem;
  22. __u32 idiag_wmem;
  23. __u32 idiag_fmem;
  24. __u32 idiag_tmem;
  25. };
  26. The fields of this structure are as follows:
  27. idiag_rmem
  28. The amount of data in the receive queue.
  29. idiag_wmem
  30. The amount of data that is queued by TCP but
  31. not yet sent.
  32. idiag_fmem
  33. The amount of memory scheduled for future
  34. use (TCP only).
  35. idiag_tmem
  36. The amount of data in send queue.
  37. INET_DIAG_SKMEMINFO
  38. The payload associated with this attribute is an
  39. array of __u32 values described below in the
  40. subsection "Socket memory information".
  41. INET_DIAG_INFO
  42. The payload associated with this attribute is
  43. specific to the address family. For TCP sockets,
  44. it is an object of type struct tcp_info.
  45. INET_DIAG_CONG
  46. The payload associated with this attribute is a
  47. string that describes the congestion control
  48. algorithm used. For TCP sockets only.


​ idiag_timer
​ For TCP sockets, this field describes the type of timer
​ that is currently active for the socket. It is set to one
​ of the following constants:

​ 0 no timer is active
​ 1 a retransmit timer
​ 2 a keep-alive timer
​ 3 a TIME_WAIT timer
​ 4 a zero window probe timer

​ For non-TCP sockets, this field is set to 0.

idiag_retrans
For idiag_timer values 1, 2, and 4, this field contains
the number of retransmits. For other idiag_timer values,
this field is set to 0.

  1. idiag_expires
  2. For TCP sockets that have an active timer, this field
  3. describes its expiration time in milliseconds. For other
  4. sockets, this field is set to 0.
  5. idiag_rqueue
  6. For listening sockets: the number of pending connections.
  7. For other sockets: the amount of data in the incoming
  8. queue.
  9. idiag_wqueue
  10. For listening sockets: the backlog length.
  11. For other sockets: the amount of memory available for
  12. sending.
  13. idiag_uid
  14. This is the socket owner UID.
  15. idiag_inode
  16. This is the socket inode number.
  17. Socket memory information
  18. The payload associated with UNIX_DIAG_MEMINFO and
  19. INET_DIAG_SKMEMINFO netlink attributes is an array of the
  20. following __u32 values:
  21. SK_MEMINFO_RMEM_ALLOC
  22. The amount of data in receive queue.
  23. SK_MEMINFO_RCVBUF
  24. The receive socket buffer as set by SO_RCVBUF.
  25. SK_MEMINFO_WMEM_ALLOC
  26. The amount of data in send queue.
  27. SK_MEMINFO_SNDBUF
  28. The send socket buffer as set by SO_SNDBUF.
  29. SK_MEMINFO_FWD_ALLOC
  30. The amount of memory scheduled for future use (TCP only).
  31. SK_MEMINFO_WMEM_QUEUED
  32. The amount of data queued by TCP, but not yet sent.
  33. SK_MEMINFO_OPTMEM
  34. The amount of memory allocated for the socket's service
  35. needs (e.g., socket filter).
  36. SK_MEMINFO_BACKLOG
  37. The amount of packets in the backlog (not yet processed).

注意上面的:INET_DIAG_INFO 与

For TCP sockets, it is an object of type struct tcp_info

Netlink in deep

https://wiki.linuxfoundation.org/networking/generic_netlink_howto

https://medium.com/thg-tech-blog/on-linux-netlink-d7af1987f89d

参考

https://djangocas.dev/blog/huge-improve-network-performance-by-change-tcp-congestion-control-to-bbr/

https://man7.org/linux/man-pages/man8/ss.8.html

分享

[转帖]可能是最完整的 TCP 连接健康指标工具 ss 的说明的更多相关文章

  1. 一个完整的TCP连接

    当我们向服务器发送HTTP请求,获取数据.修改信息时,都需要建立TCP连接,包括三次握手,四次分手. 什么是TCP连接? 为实现数据的可靠传输,TCP要在应用进程间建立传输连接.它是在两个传输用户之间 ...

  2. TCP 连接的建立和终止

    三路握手 建立一个TCP连接时会发生下述情形. (1)服务器必须准备好接受外来的连接.这通常通过调用socket.bind和listen这3个函数来完成的,我们称之为被动打开. (2)客户通过调用co ...

  3. TCP系列04—连接管理—3、TCP连接的半打开和半关闭

    在前面部分我们我们分别介绍了三次握手.四次挥手.同时打开和同时关闭,TCP连接还有两种场景分别是半打开(Half-Open)连接和半关闭(Half-Close)连接.TCP是一个全双工(Full-Du ...

  4. 详解TCP连接的“三次握手”与“四次挥手”(上)

    一.TCP connection 客户端与服务器之间数据的发送和返回的过程当中需要创建一个叫TCP connection的东西: 由于TCP不存在连接的概念,只存在请求和响应,请求和响应都是数据包,它 ...

  5. 如何kill一条TCP连接?

    原创:扣钉日记(微信公众号ID:codelogs),欢迎分享,转载请保留出处. 简介 如果你的程序写得有毛病,打开了很多TCP连接,但一直没有关闭,即常见的连接泄露场景,你可能想要在排查问题的过程中, ...

  6. TCP连接的状态详解以及故障排查

    我们通过了解 TCP各个状态 ,可以排除和定位网络或系统故障时大有帮助. 一.TCP状态 LISTENING :侦听来自远方的TCP端口的连接请求 . 首先服务端需要打开一个 socket 进行监听, ...

  7. socket使用TCP协议时,send、recv函数解析以及TCP连接关闭的问题

    Tcp协议本身是可靠的,并不等于应用程序用tcp发送数据就一定是可靠的.不管是否阻塞,send发送的大小,并不代表对端recv到多少的数据. 在阻塞模式下, send函数的过程是将应用程序请求发送的数 ...

  8. ”TCP连接“究竟是什么意思?

    我们经常听到"建立TCP连接","服务器的连接数量有限"等,但仔细一想,连接究竟是个什么东西,是和电话一样两端连起一根线?似乎有点抽象不是么? 1. 久违的分组 ...

  9. 不可不知的socket和TCP连接过程

    html { font-family: sans-serif } body { margin: 0 } article,aside,details,figcaption,figure,footer,h ...

  10. TCP 连接关闭及TIME_WAIT探究

    这里主要记录一下TCP连接在关闭的时刻,有哪些细节问题.方便在以后的程序设计中能够注意这些细节, 以避免出现这些错误.首先我们来看一下TCP的状态转换图.如<unix网络编程>卷一所示如下 ...

随机推荐

  1. VSFTPD2.3.4(笑脸漏洞)复现

    vsftpd2.3.4笑脸漏洞复现 目标服务器:metasploitable2(192.168.171.11) 渗透机:Kali(192.168.171.21) 方法一:手动复现 首先用kali扫描一 ...

  2. grafana添加组件

    ###安装grafana插件需联网安装[root@zabbix grafana]# grafana-cli plugins list-remote #查询可用的插件id: abhisant-druid ...

  3. idea配置tomcat热部署

    idea配置tomcat热部署,点击+添加一个local的tomcat服务 点击部署tab 添加Artifact...选择 一定要选择exploded,否则没有热部署选项!!! 一定要选择explod ...

  4. Netty 组件介绍

    BootStrap Netty 中的 BootStrap 分为两种:一种是客户端的 BootStrap:一种是服务端的 ServerBootStrap. 客户端的 BootStrap 初始化客户端,该 ...

  5. 斯坦福课程 UE4 C++ ActionRoguelike游戏实例教程 01.基础AI与行为树

    斯坦福课程 UE4 C++ ActionRoguelike游戏实例教程 0.绪论 前言&摘要 本篇文章是基于斯坦福UE4 C++课程的学习记录.因为B站用户surkea由于学业原因,暂停了课程 ...

  6. flutter杂知识点

    child和children用于在一个容器小部件(如Container.Column.Row等)中放置一个或多个子小部件 1.child属性用于容器只包含一个子小部件的情况: 2.children属性 ...

  7. Redis系列(二):解读redis.conf文件、配置、初步使用

    一.解读redis.conf配置文件 # redis 配置文件示例 # 当你需要为某个配置项指定内存大小的时候,必须要带上单位, # 通常的格式就是 1k 5gb 4m 等酱紫: # # 1k =&g ...

  8. 跟我读论文丨ACL2021 NER 模块化交互网络用于命名实体识别

    摘要:本文是对ACL2021 NER 模块化交互网络用于命名实体识别这一论文工作进行初步解读. 本文分享自华为云社区<ACL2021 NER | 模块化交互网络用于命名实体识别>,作者: ...

  9. ​  appuploader使用教程

    ​ appuploader使用教程 转载:appuploader使用教程 目录 问题解决秘籍 登录失败 don't have access,提示没权限或同意协议 上传后在app管理中心找不到版本提交 ...

  10. 6个步骤强化 CI/CD 安全

    快速的数字化和越来越多的远程业务运营给开发人员带来了沉重的负担,他们不断面临着更快推出软件的压力.尽管CI/CD 加速了产品发布,但它容易受到网络安全问题的影响,例如代码损坏.安全配置错误和机密管理不 ...