11 Replies Latest reply on May 30, 2017 8:55 AM by seh4nc

    Will X710 firmware update 4.53 to 5.05 address sporadic transmit queue timeout?

    seh4nc

      We have experienced three occurrences on two servers of this error "tx_timeout" / "hung_queue", and packets stopped flowing for some number of seconds (but then recovered):

       

      Apr 10 02:04:14 node39 kernel: WARNING: at net/sched/sch_generic.c:297 dev_watchdog+0x276/0x280()
      Apr 10 02:04:14 node39 kernel: NETDEV WATCHDOG: p2p1 (i40e): transmit queue 8 timed out
      ...
      Apr 10 02:04:14 node39 kernel: CPU: 0 PID: 0 Comm: swapper/0 Tainted: G OE ------------ 3.10.0-514.6.1.el7.x86_64 #1
      Apr 10 02:04:14 node39 kernel: Hardware name: Dell Inc. PowerEdge R620/01W23F, BIOS 2.1.3 11/20/2013
      ...
      Apr 10 02:04:14 node39 kernel: i40e 0000:42:00.0 p2p1: tx_timeout: VSI_seid: 390, Q 8, NTC: 0x113, HWB: 0x116, NTU: 0x116, TAIL: 0x116, INT: 0x1
      Apr 10 02:04:14 node39 kernel: i40e 0000:42:00.0 p2p1: tx_timeout recovery level 1, hung_queue 8
      Apr 10 02:04:14 node39 kernel: i40e 0000:42:00.0 p2p1: adding 3c:fd:fe:9f:b7:48 vid=0
      

       

      This is within first 3 weeks of usage of Intel X710 duo adapters running firmware 4.53 (with supported Intel SFP+) recently installed in a cluster of two-year-old Dell R620s, running CentOS 7.3:

       

      node39:/# lspci -vv | grep -A 1 10GbE
      pcilib: sysfs_read_vpd: read failed: Input/output error
      05:00.0 Ethernet controller: Intel Corporation Ethernet Controller X710 for 10GbE SFP+ (rev 01)
              Subsystem: Intel Corporation Ethernet Converged Network Adapter X710-2
      --
      05:00.1 Ethernet controller: Intel Corporation Ethernet Controller X710 for 10GbE SFP+ (rev 01)
              Subsystem: Intel Corporation Ethernet Converged Network Adapter X710
      --
      42:00.0 Ethernet controller: Intel Corporation Ethernet Controller X710 for 10GbE SFP+ (rev 01)
              Subsystem: Intel Corporation Ethernet Converged Network Adapter X710-2
      --
      42:00.1 Ethernet controller: Intel Corporation Ethernet Controller X710 for 10GbE SFP+ (rev 01)
              Subsystem: Intel Corporation Ethernet Converged Network Adapter X710
      node39:/usr/local/bin# ethtool -i p2p1
      driver: i40e
      version: 1.5.10-k
      firmware-version: 4.53 0x8000206e 0.0.0
      expansion-rom-version: 
      bus-info: 0000:42:00.0
      supports-statistics: yes
      supports-test: yes
      supports-eeprom-access: yes
      supports-register-dump: yes
      supports-priv-flags: yes
      

       

      We have used X710s without issue in a few other servers, but in those cases they are HP OEM, and running firmware 4.60:

       

      node93:/# lspci -vv |grep -A 1 10GbE
      04:00.0 Ethernet controller: Intel Corporation Ethernet Controller X710 for 10GbE SFP+ (rev 01)
              Subsystem: Hewlett-Packard Company HP Ethernet 10Gb 2-port 562FLR-SFP+ Adapter
      --
      04:00.1 Ethernet controller: Intel Corporation Ethernet Controller X710 for 10GbE SFP+ (rev 01)
              Subsystem: Hewlett-Packard Company Ethernet 10Gb 562SFP+ Adapter
      --
      05:00.0 Ethernet controller: Intel Corporation Ethernet Controller X710 for 10GbE SFP+ (rev 01)
              Subsystem: Hewlett-Packard Company HP Ethernet 10Gb 2-port 562SFP+ Adapter
      --
      05:00.1 Ethernet controller: Intel Corporation Ethernet Controller X710 for 10GbE SFP+ (rev 01)
              Subsystem: Hewlett-Packard Company Ethernet 10Gb 562SFP+ Adapter
      node93:/# ethtool -i ens2f0
      driver: i40e
      version: 1.5.10-k
      firmware-version: 4.60 0x80001f47 1.3072.0
      expansion-rom-version: 
      bus-info: 0000:05:00.0
      supports-statistics: yes
      supports-test: yes
      supports-eeprom-access: yes
      supports-register-dump: yes
      supports-priv-flags: yes
      

       

      I have downloaded nvmupdate64e and updated a spare Dell to firmware 5.05, so if this is the correct solution I have confirmed the procedure. However threads such as this one Intel X710 vs VMWare ESX: crash and reboot  give me pause-- crash and reboot would certainly be worse than a 10-20 second transmit hang.

       

      My questions are:

       

      1. Has anyone else experienced these tx_timeout / hung_queue issues?
      2. Is it a known issue? If so, is it an issue with firmware, with i40e driver, or something else such as TSO/GSO (which are currently ON but I could turn them off).
      3. If it is an issue with firmware, has it been corrected between versions 4.53 and 4.60, and is it recommended to flash production machines to 5.05, or to some other version. I could not find a detailed Change List.
      4. Is there a way (such as generating high data rates using iperf) to make the sporadic issues occur reproducibly, so that I can demonstrate whether any attempted solution has been successful.

       

      Thanks in advance!