Intel 82599网卡异常挂死原因
前提背景:
生产环境上,服务器网络突然断链,ssh连接失败。
问题初步定位:
查找内核日志,得到网卡异常信息
Jan 24 11:52:43 localhost kernel: ixgbe 0000:84:00.0: eth0: RXDCTL.ENABLE on Rx queue 14 not cleared within the polling period
Jan 24 11:52:43 localhost kernel: ixgbe 0000:84:00.0: eth0: RXDCTL.ENABLE on Rx queue 15 not cleared within the polling period
Jan 24 11:52:43 localhost kernel: bonding: bond5: link status definitely down for interface eth0, disabling it
Jan 24 11:52:43 localhost kernel: ixgbe 0000:84:00.0: eth0: detected SFP+: 5
Jan 24 11:52:43 localhost kernel: ixgbe 0000:84:00.0: eth0: NIC Link is Up 10 Gbps, Flow Control: RX/TX
Jan 24 11:52:43 localhost kernel: bond5: link status definitely up for interface eth0, 10000 Mbps full duplex.
Jan 24 11:52:47 localhost kernel: ixgbe 0000:84:00.0: eth0: Detected Tx Unit Hang
Jan 24 11:52:47 localhost kernel: ixgbe 0000:84:00.0: eth0: tx_buffer_info[next_to_clean]
Jan 24 11:52:47 localhost kernel: ixgbe 0000:84:00.0: eth0: tx hang 448 detected on queue 6, resetting adapter
Jan 24 11:52:47 localhost kernel: ixgbe 0000:84:00.0: eth0: Reset adapter
Jan 24 11:52:47 localhost kernel: ixgbe 0000:84:00.0: eth0: RXDCTL.ENABLE on Rx queue 0 not cleared within the polling period
Jan 24 11:52:47 localhost kernel: ixgbe 0000:84:00.0: eth0: RXDCTL.ENABLE on Rx queue 1 not cleared within the polling period
Jan 24 11:52:47 localhost kernel: ixgbe 0000:84:00.0: eth0: RXDCTL.ENABLE on Rx queue 2 not cleared within the polling period
Jan 24 11:52:47 localhost kernel: ixgbe 0000:84:00.0: eth0: RXDCTL.ENABLE on Rx queue 3 not cleared within the polling period
网卡PCI信息:
# lspci -vvv -s 84:00.0
84:00.0 Ethernet controller: Intel Corporation 82599EB 10-Gigabit SFI/SFP+ Network Connection (rev 01)
Control: I/O- Mem- BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
Interrupt: pin A routed to IRQ 16
Region 0: Memory at f7e20000 (64-bit, non-prefetchable) [disabled] [size=128K]
Region 2: I/O ports at f020 [disabled] [size=32]
Region 4: Memory at f7e44000 (64-bit, non-prefetchable) [disabled] [size=16K]
Capabilities: [40] Power Management version 3
Flags: PMEClk- DSI+ D1- D2- AuxCurrent=0mA PME(D0+,D1-,D2-,D3hot+,D3cold-)
Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=1 PME-
Capabilities: [50] MSI: Enable- Count=1/1 Maskable+ 64bit+
Address: 0000000000000000 Data: 0000
Masking: 00000000 Pending: 00000000
Capabilities: [70] MSI-X: Enable- Count=64 Masked-
Vector table: BAR=4 offset=00000000
PBA: BAR=4 offset=00002000
Capabilities: [a0] Express (v2) Endpoint, MSI 00
DevCap: MaxPayload 512 bytes, PhantFunc 0, Latency L0s <512ns, L1 <64us
ExtTag- AttnBtn- AttnInd- PwrInd- RBE+ FLReset+
DevCtl: Report errors: Correctable- Non-Fatal- Fatal- Unsupported-
RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop+ FLReset-
MaxPayload 128 bytes, MaxReadReq 512 bytes
DevSta: CorrErr+ UncorrErr+ FatalErr- UnsuppReq+ AuxPwr- TransPend-
LnkCap: Port #4, Speed 5GT/s, Width x8, ASPM L0s, Latency L0 <2us, L1 <32us
ClockPM- Surprise- LLActRep- BwNot-
LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- Retrain- CommClk-
ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
LnkSta: Speed 5GT/s, Width x8, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
DevCap2: Completion Timeout: Range ABCD, TimeoutDis+
DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-
LnkCtl2: Target Link Speed: 5GT/s, EnterCompliance- SpeedDis-, Selectable De-emphasis: -6dB
Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
Compliance De-emphasis: -6dB
LnkSta2: Current De-emphasis Level: -6dB
Capabilities: [e0] Vital Product Data
Unknown small resource type 06, will not decode more.
Capabilities: [100] Advanced Error Reporting
UESta: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq+ ACSViol-
UEMsk: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
UESvrt: DLP+ SDES- TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
CESta: RxErr+ BadTLP+ BadDLLP+ Rollover+ Timeout+ NonFatalErr+
CEMsk: RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+
AERCap: First Error Pointer: 14, GenCap+ CGenEn- ChkCap+ ChkEn-
Capabilities: [140] Device Serial Number 98-f5-37-ff-ff-e3-64-73
Capabilities: [150] Alternative Routing-ID Interpretation (ARI)
ARICap: MFVC- ACS-, Next Function: 1
ARICtl: MFVC- ACS-, Function Group: 0
Capabilities: [160] Single Root I/O Virtualization (SR-IOV)
IOVCap: Migration-, Interrupt Message Number: 000
IOVCtl: Enable- Migration- Interrupt- MSE- ARIHierarchy-
IOVSta: Migration-
Initial VFs: 64, Total VFs: 64, Number of VFs: 0, Function Dependency Link: 00
VF offset: 384, stride: 2, Device ID: 10ed
Supported Page Size: 00000553, System Page Size: 00000001
Region 0: Memory at 0000000000000000 (64-bit, prefetchable)
Region 3: Memory at 0000000000000000 (64-bit, prefetchable)
VF Migration: offset: 00000000, BIR: 0
Kernel driver in use: ixgbe
Kernel modules: ixgbe
网卡寄存器信息:
# ethtool -d eth0
0x042A4: LINKS (Link Status register) 0xFFFFFFFF
Link Status: up
Link Speed: 10G
0x05080: FCTRL (Filter Control register) 0xFFFFFFFF
Receive Flow Control Packets: enabled
Receive Priority Flow Control Packets: enabled
Discard Pause Frames: enabled
Pass MAC Control Frames: enabled
Broadcast Accept: enabled
Unicast Promiscuous: enabled
Multicast Promiscuous: enabled
Store Bad Packets: enabled
0x05088: VLNCTRL (VLAN Control register) 0xFFFFFFFF
VLAN Mode: enabled
VLAN Filter: enabled
0x02100: SRRCTL0 (Split and Replic Rx Control 0) 0xFFFFFFFF
Receive Buffer Size: 16KB
0x03D00: RMCS (Receive Music Control register) 0xFFFFFFFF
Transmit Flow Control: enabled
Priority Flow Control: enabled
0x04250: HLREG0 (Highlander Control 0 register) 0xFFFFFFFF
Transmit CRC: enabled
Receive CRC Strip: enabled
Jumbo Frames: enabled
Pad Short Frames: enabled
Loopback: enabled
0x00000: CTRL (Device Control) 0xFFFFFFFF
0x00008: STATUS (Device Status) 0xFFFFFFFF
0x00018: CTRL_EXT (Extended Device Control) 0xFFFFFFFF
0x00020: ESDP (Extended SDP Control) 0xFFFFFFFF
0x00028: EODSDP (Extended OD SDP Control) 0xFFFFFFFF
0x00200: LEDCTL (LED Control) 0xFFFFFFFF
........
0x01010: RDH00 (Receive Descriptor Head 00) 0xFFFFFFFF
0x01050: RDH01 (Receive Descriptor Head 01) 0xFFFFFFFF
0x01090: RDH02 (Receive Descriptor Head 02) 0xFFFFFFFF
0x010D0: RDH03 (Receive Descriptor Head 03) 0xFFFFFFFF
0x01110: RDH04 (Receive Descriptor Head 04) 0xFFFFFFFF
..........
0x01028: RXDCTL00 (Receive Descriptor Control 00) 0xFFFFFFFF
0x01068: RXDCTL01 (Receive Descriptor Control 01) 0xFFFFFFFF
0x010A8: RXDCTL02 (Receive Descriptor Control 02) 0xFFFFFFFF
........
0x06010: TDH00 (Transmit Descriptor Head 00) 0xFFFFFFFF
0x06050: TDH01 (Transmit Descriptor Head 01) 0xFFFFFFFF
0x06090: TDH02 (Transmit Descriptor Head 02) 0xFFFFFFFF
0x060D0: TDH03 (Transmit Descriptor Head 03) 0xFFFFFFFF
0x06110: TDH04 (Transmit Descriptor Head 04) 0xFFFFFFFF
0x06150: TDH05 (Transmit Descriptor Head 05) 0xFFFFFFFF
问题可能原因:
Bar0地址看起来没有问题,但寄存器全是0xffffffff了; 82599寄存器开始是正常的, 跑了一段时间(10小时)就 变成FFFF了;
可能pcie 接口接触问题。
Intel 82599网卡异常挂死原因的更多相关文章
- intel 82599网卡(ixgbe系列)术语表
Intel® 82599 10 GbE Controller Datasheet 15.0 Glossary and Acronyms 术语表 缩写 英文解释 中文解释 1 KB A value of ...
- 用strace处理程序异常挂死情况
1. 环境: ubuntu 系统 + strace + vim 2.编写挂死程序:(参考博客) #include <stdio.h> #include <sys/types.h> ...
- 关于PF_RING/Intel 82599/透明VPN的一些事
接近崩溃的边缘,今天这篇文章构思地点在医院,小小又生病了,宁可吊瓶不吃药,带了笔记本却无法上网,我什么都不能干,想了解一些东西,只能用3G,不敢 开热点,因为没人给我报销流量,本周末我只有一天时间,因 ...
- 大约PF_RING/Intel 82599/透明VPN一些事
接近崩溃的边缘,如今,在医院这篇文章地方的想法,小病,我宁愿不吃药瓶.一台笔记本电脑,但无法上网,我不称职.想知道的东西.唯一可用3G,不开的热点.由于没人给我报销流程.这个周末,我只有一天,由于下雨 ...
- 记一次 .NET 某上市工业智造 CPU+内存+挂死 三高分析
一:背景 1. 讲故事 上个月有位朋友加wx告知他的程序有挂死现象,询问如何进一步分析,截图如下: 看这位朋友还是有一定的分析基础,可能玩的少,缺乏一定的分析经验,当我简单分析之后,我发现这个dump ...
- 应用程序出现挂死,.NET Runtime at IP 791F7E06 (79140000) with exit code 80131506.
工具出现挂死问题 1.问题描述 工具出现挂死问题,巡检IIS发现以下异常日志 现网系统日志: 事件类型: 错误 事件来源: .NET Runtime 描述: Application: Di ...
- I2C 挂死,SDA一直为低问题分析【转】
转自:https://blog.csdn.net/winitz/article/details/72460775 版权声明:本文为博主原创文章,未经博主允许不得转载. https://blog.csd ...
- IIC挂死问题解决过程
0.环境:arm CPU 带有IIC控制器作为slave端,带有调试串口. 1.bug表现:IIC slave 在系统启动后概率挂死,导致master无法detect到slave. 猜测1:认为IIC ...
- MySQL 连接为什么挂死了?
摘要:本次分享的是一次关于 MySQL 高可用问题的定位过程,其中曲折颇多但问题本身却比较有些代表性,遂将其记录以供参考. 一.背景 近期由测试反馈的问题有点多,其中关于系统可靠性测试提出的问题令人感 ...
随机推荐
- Unity 3D中不得不说的yield协程与消息传递
1. 协程 在Unity 3D中,我们刚开始写脚本的时候肯定会遇到类似下面这样的需求:每隔3秒发射一个烟花.怪物死亡后20秒再复活之类的.刚开始的时候喜欢把这些东西都塞到Update里面去,就像下面这 ...
- java 身份证工具类
package com.app.wx.common.util; import org.apache.commons.lang3.StringUtils; import java.text.ParseE ...
- 第四节 Python基础之数据类型(集合)
在学习本节之前,我们先对数据类型做一个补充,也就是数据类型的分类: 按照可变和不可变来分: 可变:列表,字典 不可变:数字,字符串,元组 按照访问顺序来分: 顺序访问:字符串,列表,元组 映射的方式访 ...
- 使用Let's Encrypt搭建永久免费的HTTPS服务
1.概述1.1 HTTPS概述HTTPS即HTTP + TLS,TLS 是传输层加密协议,它的前身是 SSL 协议.我们知道HTTP协议是基于TCP的.简而言之HTTPS就是在TCP的基础上套一层TL ...
- Windows10上桌面共享
Windows自带的桌面共享软件 命令行输入: Msra.exe
- [蓝桥杯]PREV-22.历届试题_国王的烦恼
问题描述 C国由n个小岛组成,为了方便小岛之间联络,C国在小岛间建立了m座大桥,每座大桥连接两座小岛.两个小岛间可能存在多座桥连接.然而,由于海水冲刷,有一些大桥面临着不能使用的危险. 如果两个小岛间 ...
- LAB1 partV
partV 创建文档反向索引.word -> document 与 前面做的 单词统计类似,这个是单词与文档位置的映射关系. mapF 文档解析相同,返回信息不同而已. reduceF 返回归约 ...
- cmake中添加-fPIC编译选项方法
合并openjpeg/soxr/vidstab/snappy等多个cmake库时,为了解决下述问题: relocation R_X86_64_32 against `.text' can not be ...
- Docker安装使用battery historian
apt-get insatll docker.io battery historian ubuntu下使用 首先要确保是google浏览器,然后用命令行 google-chrome --proxy-s ...
- 使用docker加载已有镜像安装Hyperledger Fabric v1.1.0
背景 每次在新的服务器上安装Hyperledger Fabric网络时,通过fabric官方提供的脚本安装时,需要从网络上down下近10G的fabric相关镜像,这个过程是漫长及痛苦的,有时因网络问 ...