Greetings Intel Community,
I have several systems that experience a system panic with a currently unknown cause. The brief overview of the configuration is:
OS: SLES 11 SP 4
ixgbe driver (inbox): 3.19.1-k
Adapter: Each server has two dual-port X520-SR2 networking adapters installed (4x10G ports total, 2 physical adapters)
Configuration: all 4 interfaces under bond0 in mode=802.3ad (lacp) to a managed switch with the matching configuration
Note #1: We've seen the crashes on sles11sp3 (as well as RHEL systems too), different kernels, and different ixgbe drivers. This is just the current configuration that is experiencing the issue.
Note #2: Load & types of traffic don't seem to matter, the system can crash with or without load, the crash has happened with just a SSH session open.
With some crash dump information, it seems that the adapters have issues with "IOH timeouts" and we also see "CATERRs".
0x000A:IOH: D3_IOBAS_IOLIM_SSTS @ 0x00000181C = 0x0000000040006060
0x000A:IOH: D3_IOBAS_IOLIM_SSTS:SIGSYSTEMERROR <30> = 0x1
0x000A:IOH: D3_IOBAS_IOLIM_SSTS:IOBASEADDRLIMIT <15:12> = 0x6
0x000A:IOH: D3_IOBAS_IOLIM_SSTS:IOBASEADDR <7:4> = 0x6
0x0024 r002i23b02 IOH on IP93-5 sn RPM031 10 Stuck RH Tracker Entry - IOH (6 total on this node)
0x003E r002i23b15 IP93-5 sn RRB997 9 RH H0 detected TRB Timeout - Destination Node
0x0004 r002i01b02 SKT0 on IP93-5 sn RPX686 8 RH H0 detected TRB Timeout - Requester
0x0000 r002i01b00 IP93-5 sn RPJ019 7 RH H0 detected TRB Timeout - Destination Node
What I'm hoping to get is a method to further debug what is happening with the driver during a system crash, is there a way to build the ixgbe driver with debugging options that generate information that could be useful to understanding the problem more.
I'm also wondering if anyone has any suggestions for kernel parameters that might help these types of issues.
Thanks for any input.