1 Reply Latest reply on Mar 15, 2016 7:44 PM by wb_Intel

    Intel 82599ES 10-Gigabit Adapter Causing System Panic (SLES11SP4)

    KarlW

      Greetings Intel Community,

       

      I have several systems that experience a system panic with a currently unknown cause. The brief overview of the configuration is:

       

      OS: SLES 11 SP 4

      Kernel: 3.0.101-68-default

      ixgbe driver (inbox): 3.19.1-k

      Adapter: Each server has two dual-port X520-SR2 networking adapters installed (4x10G ports total, 2 physical adapters)

      Configuration: all 4 interfaces under bond0 in mode=802.3ad (lacp) to a managed switch with the matching configuration

      Note #1: We've seen the crashes on sles11sp3 (as well as RHEL systems too), different kernels, and different ixgbe drivers. This is just the current configuration that is experiencing the issue.

      Note #2: Load & types of traffic don't seem to matter, the system can crash with or without load, the crash has happened with just a SSH session open.

       

      With some crash dump information, it seems that the adapters have issues with "IOH timeouts" and we also see "CATERRs".

       

      0x000A:IOH: D3_IOBAS_IOLIM_SSTS @ 0x00000181C = 0x0000000040006060

      0x000A:IOH: D3_IOBAS_IOLIM_SSTS:SIGSYSTEMERROR <30> = 0x1

      0x000A:IOH: D3_IOBAS_IOLIM_SSTS:IOBASEADDRLIMIT <15:12> = 0x6

      0x000A:IOH: D3_IOBAS_IOLIM_SSTS:IOBASEADDR <7:4> = 0x6

       

      Also...

       

      0x0024 r002i23b02 IOH on IP93-5 sn RPM031 10 Stuck RH Tracker Entry - IOH (6 total on this node)

      0x003E r002i23b15 IP93-5 sn RRB997 9 RH H0 detected TRB Timeout - Destination Node

      0x0004 r002i01b02 SKT0 on IP93-5 sn RPX686 8 RH H0 detected TRB Timeout - Requester

      0x0000 r002i01b00 IP93-5 sn RPJ019 7 RH H0 detected TRB Timeout - Destination Node

       

       

      What I'm hoping to get is a method to further debug what is happening with the driver during a system crash, is there a way to build the ixgbe driver with debugging options that generate information that could be useful to understanding the problem more.

       

      I'm also wondering if anyone has any suggestions for kernel parameters that might help these types of issues.

       

      Thanks for any input.