14 Replies Latest reply on Nov 14, 2013 8:53 AM by ricbartm

    Issue with "Detected Tx Unit Hang" dropping network connections

    MatthewPike

      Hello,

       

      We are having an issue with our NICs getting a TX Unit Hang and the adaptor not resetting correctly.  The below error messages are displayed to the console at a vigorous rate and all networking stops. Connecting via IPMI I've found that "service network restart" doesn't resolve the issue. I've found the following steps do work: service network stop; rmmod ixgbe; modprobe ixgbe; service network start. Then everything goes back to normal for some random number of hours (or in some cases days) until it happens again.  If anyone has any insight or history with this issue I'd love any input. Also I'd be happy to provide more details where needed.

       

      Thanks,

      Matthew

       

      The details:

      kernel: 2.6.32-358.6.2.el6

      Intel diver versions tested: 3.9.15-k (CentOS stock), 3.17.3 (latest version)

      Adaptor: Ethernet controller: Intel Corporation 82599EB 10-Gigabit Network Connection (rev 01)

                     Subsystem: Intel Corporation Ethernet Server Adapter X520-2

       

      The error messages from /var/log/messages (and dmesg):

      [kern.err] [kernel: .]: ixgbe 0000:08:00.1: eth3: Detected Tx Unit Hang

      [kern.err] [kernel: .]: Tx Queue             <2>

      [kern.err] [kernel: .]: TDH, TDT             <0>, <1a>

      [kern.err] [kernel: .]: next_to_use          <1a>

      [kern.err] [kernel: .]: next_to_clean        <0>

      [kern.err] [kernel: .]: tx_buffer_info[next_to_clean]

      [kern.err] [kernel: .]: time_stamp           <101fd8552>

      [kern.err] [kernel: .]: jiffies              <101fd8d43>

      [kern.info] [kernel: .]: ixgbe 0000:08:00.1: eth3: tx hang 301 detected on queue 2, resetting adapter

      [kern.err] [kernel: .]: ixgbe 0000:08:00.1: eth3: Reset adapter

      [kern.err] [kernel: .]: ixgbe 0000:08:00.1: eth3: RXDCTL.ENABLE on Rx queue 0 not cleared within the polling period

      [kern.err] [kernel: .]: ixgbe 0000:08:00.1: eth3: RXDCTL.ENABLE on Rx queue 1 not cleared within the polling period

      [kern.err] [kernel: .]: ixgbe 0000:08:00.1: eth3: RXDCTL.ENABLE on Rx queue 2 not cleared within the polling period

      [kern.err] [kernel: .]: ixgbe 0000:08:00.1: eth3: RXDCTL.ENABLE on Rx queue 3 not cleared within the polling period

      [kern.err] [kernel: .]: ixgbe 0000:08:00.1: eth3: RXDCTL.ENABLE on Rx queue 4 not cleared within the polling period

      [kern.err] [kernel: .]: ixgbe 0000:08:00.1: eth3: RXDCTL.ENABLE on Rx queue 5 not cleared within the polling period

      [kern.err] [kernel: .]: ixgbe 0000:08:00.1: eth3: RXDCTL.ENABLE on Rx queue 6 not cleared within the polling period

      [kern.err] [kernel: .]: ixgbe 0000:08:00.1: eth3: RXDCTL.ENABLE on Rx queue 7 not cleared within the polling period

      [kern.err] [kernel: .]: ixgbe 0000:08:00.1: eth3: RXDCTL.ENABLE on Rx queue 8 not cleared within the polling period

      [kern.err] [kernel: .]: ixgbe 0000:08:00.1: eth3: RXDCTL.ENABLE on Rx queue 9 not cleared within the polling period

      [kern.err] [kernel: .]: ixgbe 0000:08:00.1: eth3: RXDCTL.ENABLE on Rx queue 10 not cleared within the polling period

      [kern.err] [kernel: .]: ixgbe 0000:08:00.1: eth3: RXDCTL.ENABLE on Rx queue 11 not cleared within the polling period

      [kern.err] [kernel: .]: ixgbe 0000:08:00.1: master disable timed out

      [kern.err] [kernel: .]: ixgbe 0000:08:00.1: eth3: RXDCTL.ENABLE on Rx queue 0 not cleared within the polling period

      [kern.err] [kernel: .]: ixgbe 0000:08:00.1: eth3: RXDCTL.ENABLE on Rx queue 1 not cleared within the polling period

      [kern.err] [kernel: .]: ixgbe 0000:08:00.1: eth3: RXDCTL.ENABLE on Rx queue 2 not cleared within the polling period

      [kern.err] [kernel: .]: ixgbe 0000:08:00.1: eth3: RXDCTL.ENABLE on Rx queue 3 not cleared within the polling period

      [kern.err] [kernel: .]: ixgbe 0000:08:00.1: eth3: RXDCTL.ENABLE on Rx queue 4 not cleared within the polling period

      [kern.err] [kernel: .]: ixgbe 0000:08:00.1: eth3: RXDCTL.ENABLE on Rx queue 5 not cleared within the polling period

      [kern.err] [kernel: .]: ixgbe 0000:08:00.1: eth3: RXDCTL.ENABLE on Rx queue 6 not cleared within the polling period

      [kern.err] [kernel: .]: ixgbe 0000:08:00.1: eth3: RXDCTL.ENABLE on Rx queue 7 not cleared within the polling period

      [kern.err] [kernel: .]: ixgbe 0000:08:00.1: eth3: RXDCTL.ENABLE on Rx queue 8 not cleared within the polling period

      [kern.err] [kernel: .]: ixgbe 0000:08:00.1: eth3: RXDCTL.ENABLE on Rx queue 9 not cleared within the polling period

      [kern.err] [kernel: .]: ixgbe 0000:08:00.1: eth3: RXDCTL.ENABLE on Rx queue 10 not cleared within the polling period

      [kern.err] [kernel: .]: ixgbe 0000:08:00.1: eth3: RXDCTL.ENABLE on Rx queue 11 not cleared within the polling period

      [kern.info] [kernel: .]: ixgbe 0000:08:00.1: eth3: detected SFP+: 4

      [kern.err] [kernel: .]: ixgbe 0000:08:00.1: eth3: Reset adapter

      [kern.err] [kernel: .]: ixgbe 0000:08:00.1: eth3: RXDCTL.ENABLE on Rx queue 0 not cleared within the polling period

      [kern.err] [kernel: .]: ixgbe 0000:08:00.1: eth3: RXDCTL.ENABLE on Rx queue 1 not cleared within the polling period

      [kern.err] [kernel: .]: ixgbe 0000:08:00.1: eth3: RXDCTL.ENABLE on Rx queue 2 not cleared within the polling period

      [kern.err] [kernel: .]: ixgbe 0000:08:00.1: eth3: RXDCTL.ENABLE on Rx queue 3 not cleared within the polling period

      [kern.err] [kernel: .]: ixgbe 0000:08:00.1: eth3: RXDCTL.ENABLE on Rx queue 4 not cleared within the polling period

      [kern.err] [kernel: .]: ixgbe 0000:08:00.1: eth3: RXDCTL.ENABLE on Rx queue 5 not cleared within the polling period

      [kern.err] [kernel: .]: ixgbe 0000:08:00.1: eth3: RXDCTL.ENABLE on Rx queue 6 not cleared within the polling period

      [kern.err] [kernel: .]: ixgbe 0000:08:00.1: eth3: RXDCTL.ENABLE on Rx queue 7 not cleared within the polling period

      [kern.err] [kernel: .]: ixgbe 0000:08:00.1: eth3: RXDCTL.ENABLE on Rx queue 8 not cleared within the polling period

      [kern.err] [kernel: .]: ixgbe 0000:08:00.1: eth3: RXDCTL.ENABLE on Rx queue 9 not cleared within the polling period

      [kern.err] [kernel: .]: ixgbe 0000:08:00.1: eth3: RXDCTL.ENABLE on Rx queue 10 not cleared within the polling period

      [kern.err] [kernel: .]: ixgbe 0000:08:00.1: eth3: RXDCTL.ENABLE on Rx queue 11 not cleared within the polling period

      [kern.err] [kernel: .]: ixgbe 0000:08:00.1: master disable timed out

      [kern.err] [kernel: .]: ixgbe 0000:08:00.1: eth3: RXDCTL.ENABLE on Rx queue 0 not cleared within the polling period

      [kern.err] [kernel: .]: ixgbe 0000:08:00.1: eth3: RXDCTL.ENABLE on Rx queue 1 not cleared within the polling period

      [kern.err] [kernel: .]: ixgbe 0000:08:00.1: eth3: RXDCTL.ENABLE on Rx queue 2 not cleared within the polling period

      [kern.err] [kernel: .]: ixgbe 0000:08:00.1: eth3: RXDCTL.ENABLE on Rx queue 3 not cleared within the polling period

      [kern.err] [kernel: .]: ixgbe 0000:08:00.1: eth3: RXDCTL.ENABLE on Rx queue 4 not cleared within the polling period

      [kern.err] [kernel: .]: ixgbe 0000:08:00.1: eth3: RXDCTL.ENABLE on Rx queue 5 not cleared within the polling period

      [kern.err] [kernel: .]: ixgbe 0000:08:00.1: eth3: RXDCTL.ENABLE on Rx queue 6 not cleared within the polling period

      [kern.err] [kernel: .]: ixgbe 0000:08:00.1: eth3: RXDCTL.ENABLE on Rx queue 7 not cleared within the polling period

      [kern.err] [kernel: .]: ixgbe 0000:08:00.1: eth3: RXDCTL.ENABLE on Rx queue 8 not cleared within the polling period

      [kern.err] [kernel: .]: ixgbe 0000:08:00.1: eth3: RXDCTL.ENABLE on Rx queue 9 not cleared within the polling period

      [kern.err] [kernel: .]: ixgbe 0000:08:00.1: eth3: RXDCTL.ENABLE on Rx queue 10 not cleared within the polling period

      [kern.err] [kernel: .]: ixgbe 0000:08:00.1: eth3: RXDCTL.ENABLE on Rx queue 11 not cleared within the polling period

      [kern.info] [kernel: .]: ixgbe 0000:08:00.1: eth3: detected SFP+: 4

      [kern.info] [kernel: .]: ixgbe 0000:08:00.1: eth3: NIC Link is Up 10 Gbps, Flow Control: RX/TX

      [kern.err] [kernel: .]: ixgbe 0000:08:00.1: eth3: Detected Tx Unit Hang

      [kern.err] [kernel: .]: Tx Queue             <2>

      [kern.err] [kernel: .]: TDH, TDT             <0>, <2>

      [kern.err] [kernel: .]: next_to_use          <2>

      [kern.err] [kernel: .]: next_to_clean        <0>

      [kern.err] [kernel: .]: tx_buffer_info[next_to_clean]

      [kern.err] [kernel: .]: time_stamp           <101fd91c6>

      [kern.err] [kernel: .]: jiffies              <101fd9257>

      [kern.info] [kernel: .]: ixgbe 0000:08:00.1: eth3: tx hang 303 detected on queue 2, resetting adapter

        • 1. Re: Issue with "Detected Tx Unit Hang" dropping network connections
          domby

          We get the same issues with 2 x520 NIC under 3.11.1-1.el6xen.x86_64 kernels.

           

          ixgbe 0000:09:00.0 eth0: initiating reset due to tx timeout

          ixgbe 0000:09:00.0 eth0: Reset adapter

          ixgbe 0000:09:00.0 eth0: RXDCTL.ENABLE on Rx queue 0 not cleared within the polling period

          ixgbe 0000:09:00.0 eth0: RXDCTL.ENABLE on Rx queue 1 not cleared within the polling period

          ixgbe 0000:09:00.0 eth0: RXDCTL.ENABLE on Rx queue 2 not cleared within the polling period

          ixgbe 0000:09:00.0 eth0: RXDCTL.ENABLE on Rx queue 3 not cleared within the polling period

          ixgbe 0000:09:00.0 eth0: RXDCTL.ENABLE on Rx queue 4 not cleared within the polling period

          ixgbe 0000:09:00.0 eth0: RXDCTL.ENABLE on Rx queue 5 not cleared within the polling period

          br2: port 1(eth0) entered disabled state

          ixgbe 0000:09:00.0: master disable timed out

          ixgbe 0000:09:00.0 eth0: RXDCTL.ENABLE on Rx queue 0 not cleared within the polling period

          ixgbe 0000:09:00.0 eth0: RXDCTL.ENABLE on Rx queue 1 not cleared within the polling period

          ixgbe 0000:09:00.0 eth0: RXDCTL.ENABLE on Rx queue 2 not cleared within the polling period

          ixgbe 0000:09:00.0 eth0: RXDCTL.ENABLE on Rx queue 3 not cleared within the polling period

          ixgbe 0000:09:00.0 eth0: RXDCTL.ENABLE on Rx queue 4 not cleared within the polling period

          ixgbe 0000:09:00.0 eth0: RXDCTL.ENABLE on Rx queue 5 not cleared within the polling period

          ixgbe 0000:09:00.0 eth0: detected SFP+: 3

          • 2. Re: Issue with "Detected Tx Unit Hang" dropping network connections
            ricbartm

            Hi all,

             

            We are suffering the same issues. Our scenario is the following:

            • Linux 3.10.11
            • ixgbe version 3.18.7
            • 2 x 10 core CPU with HT enabled (40 core total)
            • Dual-port Intel card (82599EB 10-Gigabit)
            • We use compatible Direct-Attach cables
            • We do interface bonding using LACP passive mode [1]
            • We do VLAN tagging over the bonding interface.
            • We receive the "hang unit" in both interfaces, randomly. We are unable to reproduce it.
            • We are running 40 queues (RSS) per interface spread across the cores. NIC1 queues are spread across cores within the same CPU. NIC2 queues are spread across cores within the other CPU. See [2].
            • Bonding interface transfer rate is 1.20 gbps. Packets/s is 260k/s
            • User space daemon HAProxy listens external connections and forward them to internal servers. Server process the request and reply to HAProxy.
            • HAproxy is not binded to any specific NUMA node, so it may be assigned to any (0-1)

             

            The server acts as load balancer. Their responsibilities are:

            • Running a Layer 7 user-space daemon (HAProxy). HTTP request and response is usually copied into user space
            • LVS load balancing in DSR (Direct Server Response) mode.

             

            We have experienced this issue with different ixgbe kernel modules and with different kernels. There is a similar bug [3] in sourceforge but closed due to inactivity.

             

            Regards,

             



            [1] We are testing this server with VLAN tagging over the physical interface without using kernel bonding at all in order to discard bonding being the issue.


            [2] We used to receive "ixgbe: Invalid Receive-Side Scaling (RSS) specified (40),  using default." errors, but the number of queues was indeed 40 (amount of cores). We are testing with 16 queues per interface.


            [3]

            Intel Ethernet Drivers and Utilities / Bugs / #7 Tx hang on 82599

            • 3. Re: Issue with "Detected Tx Unit Hang" dropping network connections
              MatthewPike

              We've recently deployed the latest (3.18.7-1) version of the driver and the problem persists. I've got a case open with Intel and I'm still waiting to hear back from my latest update to the ticket.

               

              Matthew

              • 4. Re: Issue with "Detected Tx Unit Hang" dropping network connections
                ricbartm

                Hello Matthew,

                 

                We also have an open case with Intel through our partner (Supermicro). I'll provide the feedback / news they provide to us in this thread.

                • 5. Re: Issue with "Detected Tx Unit Hang" dropping network connections
                  MatthewPike

                  Thanks! I really appreciate the post and follow up. This issue is a pain. :-)

                  • 6. Re: Issue with "Detected Tx Unit Hang" dropping network connections
                    MatthewPike

                    So I emailed Intel in regards to an update on my request for escalation. I heard back there was no update, then got an email 2 days later saying my case was closed (the date being the same day I asked for an update). Gonna re-open and try again. Have you had any better luck or any insights?

                     

                    Matthew

                    • 7. Re: Issue with "Detected Tx Unit Hang" dropping network connections
                      ricbartm

                      Hello Matthew,

                       

                      We managed this issue through Supermicro, the partner we bought the 10G card to. Our issue started to be managed on Russia support using our Supermicro partner as intermediary but given the complexity of the issue and the amount of data and explanations we needed to exchange we managed to get an engineer from the R&D team in Israel.

                       

                      At the moment we are still suffering the issue. My ongoing tests shown the same results. I tested:

                      1. Reduce amount of RSS queues per interface to from 40 to 16

                      2. Use raw interfaces with VLAN tagging rather than VLAN tagging over the 802.1ad LACP bonding.

                       

                      Today we'll test disabling LRO in the interfaces by using kernel module load parameters. Intel engineer suggested that because we have seen that command [1] on different interfaces was not increasing. IMHO this happen for two reasons:

                      1. HAProxy splice is not working as expected (see [2])

                      2. As I have seen in the README of the driver some doc about it. (see [3])

                       

                      I have to admit that there are many warning regarding LRO and ip routing/bridging in the README, but we are not doing that. We are accepting the connection, processing it, opening new connection to backend and hence sending the response to the previously opened connection from client.

                       

                      Regards,

                       

                       

                      [1]

                      # ethtool -S eth4 | grep lro

                           lro_aggregated: 0

                           lro_flushed: 0

                       

                      [2]

                      splice (system call) - Wikipedia, the free encyclopedia

                      HAProxy - The Reliable, High Performance TCP/HTTP Load Balancer

                       

                      [3]

                       

                       

                      Hardware Receive Side Coalescing (HW RSC)

                      -----------------------------------------

                      82599 and X540-based adapters support HW RSC, which can merge multiple frames

                      from the same IPv4 TCP/IP flow into a single structure that can span one or

                      more descriptors. It works similarly to Software Large Receive Offload

                      technique. By default HW RSC is enabled and SW LRO cannot be used for 82599-

                      or X540-based adapters unless HW RSC is disabled.

                       

                       

                      IXGBE_NO_HW_RSC is a compile time flag. The user can enable it at compile time

                      to remove support for HW RSC from the driver. The flag is used by adding

                      CFLAGS_EXTRA="-DIXGBE_NO_HW_RSC" to the make file when it is being compiled.

                      make CFLAGS_EXTRA="-DIXGBE_NO_HW_RSC" install

                      • 8. Re: Issue with "Detected Tx Unit Hang" dropping network connections
                        r00twayne

                        I am seeing the same issue ::

                         

                        # uname -r

                        3.11.6-1.el6xen.x86_64

                         

                        03:00.0 Ethernet controller: Intel Corporation 82599EB 10-Gigabit SFI/SFP+ Network Connection (rev 01)

                        03:00.1 Ethernet controller: Intel Corporation 82599EB 10-Gigabit SFI/SFP+ Network Connection (rev 01)

                         

                        # ethtool -i eth0

                        driver: ixgbe

                        version: 3.13.10-k

                        firmware-version: 0x80000208

                        bus-info: 0000:03:00.0

                        supports-statistics: yes

                        supports-test: yes

                        supports-eeprom-access: yes

                        supports-register-dump: yes

                        supports-priv-flags: no

                         

                         

                        # ethtool -k eth0

                        Features for eth0:

                        rx-checksumming: on

                        tx-checksumming: on

                                tx-checksum-ipv4: on

                                tx-checksum-ip-generic: off [fixed]

                                tx-checksum-ipv6: on

                                tx-checksum-fcoe-crc: on [fixed]

                                tx-checksum-sctp: on

                        scatter-gather: on

                                tx-scatter-gather: on

                                tx-scatter-gather-fraglist: off [fixed]

                        tcp-segmentation-offload: on

                                tx-tcp-segmentation: on

                                tx-tcp-ecn-segmentation: off [fixed]

                                tx-tcp6-segmentation: on

                        udp-fragmentation-offload: off [fixed]

                        generic-segmentation-offload: on

                        generic-receive-offload: off

                        large-receive-offload: off

                        rx-vlan-offload: on

                        tx-vlan-offload: on

                        ntuple-filters: off

                        receive-hashing: on

                        highdma: on [fixed]

                        rx-vlan-filter: on

                        vlan-challenged: off [fixed]

                        tx-lockless: off [fixed]

                        netns-local: off [fixed]

                        tx-gso-robust: off [fixed]

                        tx-fcoe-segmentation: on [fixed]

                        tx-gre-segmentation: off [fixed]

                        tx-udp_tnl-segmentation: off [fixed]

                        tx-mpls-segmentation: off [fixed]

                        fcoe-mtu: off [fixed]

                        tx-nocache-copy: on

                        loopback: off [fixed]

                        rx-fcs: off [fixed]

                        rx-all: off

                        tx-vlan-stag-hw-insert: off [fixed]

                        rx-vlan-stag-hw-parse: off [fixed]

                        rx-vlan-stag-filter: off [fixed]

                         

                        # cat /etc/redhat-release

                        CentOS release 6.4 (Final)

                         

                         

                        ixgbe 0000:03:00.0 eth0: tx hang 1291 detected on queue 1, resetting adapter

                        ixgbe 0000:03:00.0 eth0: initiating reset due to tx timeout

                        ixgbe 0000:03:00.0 eth0: Detected Tx Unit Hang

                          Tx Queue             <2>

                          TDH, TDT             <0>, <2>

                          next_to_use          <2>

                          next_to_clean        <0>

                        tx_buffer_info[next_to_clean]

                          time_stamp           <1049a1850>

                          jiffies              <1049a250c>

                        ixgbe 0000:03:00.0 eth0: tx hang 1291 detected on queue 2, resetting adapter

                        ixgbe 0000:03:00.0 eth0: initiating reset due to tx timeout

                        ixgbe 0000:03:00.0 eth0: Reset adapter

                        ixgbe 0000:03:00.0 eth0: RXDCTL.ENABLE on Rx queue 0 not cleared within the polling period

                        ixgbe 0000:03:00.0 eth0: RXDCTL.ENABLE on Rx queue 1 not cleared within the polling period

                        ixgbe 0000:03:00.0 eth0: RXDCTL.ENABLE on Rx queue 2 not cleared within the polling period

                        ixgbe 0000:03:00.0 eth0: RXDCTL.ENABLE on Rx queue 3 not cleared within the polling period

                        bonding: bond0: link status definitely down for interface eth0, disabling it

                        ixgbe 0000:03:00.0: master disable timed out

                        ixgbe 0000:03:00.0 eth0: RXDCTL.ENABLE on Rx queue 0 not cleared within the polling period

                        ixgbe 0000:03:00.0 eth0: RXDCTL.ENABLE on Rx queue 1 not cleared within the polling period

                        ixgbe 0000:03:00.0 eth0: RXDCTL.ENABLE on Rx queue 2 not cleared within the polling period

                        ixgbe 0000:03:00.0 eth0: RXDCTL.ENABLE on Rx queue 3 not cleared within the polling period

                        ixgbe 0000:03:00.0 eth0: detected SFP+: 65535

                        ixgbe 0000:03:00.0 eth0: NIC Link is Up 10 Gbps, Flow Control: RX/TX

                        bonding: bond0: link status definitely up for interface eth0, 10000 Mbps full duplex.

                        ixgbe 0000:03:00.1 eth1: Detected Tx Unit Hang

                        • 9. Re: Issue with "Detected Tx Unit Hang" dropping network connections
                          r00twayne

                          Has anybody had any updates on this issue and/or have found a resolution?

                          • 10. Re: Issue with "Detected Tx Unit Hang" dropping network connections
                            ricbartm

                            So far we have been told to try:

                            • Disable GRO
                            • Disable LRO

                             

                            I have to admit hat given the way such settings are changed (kernel module parameters and ethtool respectively), one server (LRO disabled) was rebooted and the other (GRO disabled) was not. The GRO-disabled server suffered the Unit Hang issue a day afterwards, as usual. On the other hand the LRO-disabled server stays stable after 6 days and 21 hours to be precise. Our experience says that this is normal and we think it's too early for getting a conclusion about the result of these tests.

                             

                            We think (not verified) that system uptime has some kind of relation with the issue we are suffering, reason why during the first week all goes well and then they start failing several times a week. We don't know yet if it has relation with the amount of specific events that occur, or with the amount of data transferred.

                             

                            I'll keep posting here any update about this issue.

                             

                            Regards,

                            • 11. Re: Issue with "Detected Tx Unit Hang" dropping network connections
                              pmorgan.sa

                              I just bumped into it on a system running RH6.4 and using the bundled 3.15.9-k bundled driver.

                               

                              I fixed it by going to: https://downloadcenter.intel.com/Detail_Desc.aspx?agr=Y&DwnldID=14687

                              Downloaded the latest driver, built it from source and installed it, then ran dracut -f to reload the new module into the initrd.  It works fine now, no more TX Hangs every few seconds.

                               

                              Hope that helps,

                              pete

                              • 12. Re: Issue with "Detected Tx Unit Hang" dropping network connections
                                r00twayne

                                Updating to the latest driver did not help for us. The issue was seen again around 4 days later.

                                 

                                CentOS release 6.4 (Final)

                                 

                                Kernel Version: 3.11.7-1.el6xen.x86_64

                                 

                                # ethtool -i eth0

                                driver: ixgbe

                                version: 3.18.7

                                firmware-version: 0x80000208

                                bus-info: 0000:03:00.0

                                supports-statistics: yes

                                supports-test: yes

                                supports-eeprom-access: yes

                                supports-register-dump: yes

                                supports-priv-flags: no

                                 

                                 

                                # ethtool -k eth0

                                Features for eth0:

                                rx-checksumming: on

                                tx-checksumming: on

                                        tx-checksum-ipv4: on

                                        tx-checksum-ip-generic: off [fixed]

                                        tx-checksum-ipv6: on

                                        tx-checksum-fcoe-crc: on [fixed]

                                        tx-checksum-sctp: on

                                scatter-gather: on

                                        tx-scatter-gather: on

                                        tx-scatter-gather-fraglist: off [fixed]

                                tcp-segmentation-offload: on

                                        tx-tcp-segmentation: on

                                        tx-tcp-ecn-segmentation: off [fixed]

                                        tx-tcp6-segmentation: on

                                udp-fragmentation-offload: off [fixed]

                                generic-segmentation-offload: on

                                generic-receive-offload: off

                                large-receive-offload: off

                                rx-vlan-offload: on

                                tx-vlan-offload: on

                                ntuple-filters: off

                                receive-hashing: on

                                highdma: on [fixed]

                                rx-vlan-filter: on [fixed]

                                vlan-challenged: off [fixed]

                                tx-lockless: off [fixed]

                                netns-local: off [fixed]

                                tx-gso-robust: off [fixed]

                                tx-fcoe-segmentation: on [fixed]

                                tx-gre-segmentation: off [fixed]

                                tx-udp_tnl-segmentation: off [fixed]

                                tx-mpls-segmentation: off [fixed]

                                fcoe-mtu: off [fixed]

                                tx-nocache-copy: on

                                loopback: off [fixed]

                                rx-fcs: off [fixed]

                                rx-all: off [fixed]

                                tx-vlan-stag-hw-insert: off [fixed]

                                rx-vlan-stag-hw-parse: off [fixed]

                                rx-vlan-stag-filter: off [fixed]

                                • 13. Re: Issue with "Detected Tx Unit Hang" dropping network connections
                                  MatthewPike

                                  We've been running the latest version, 3.18.7-1 almost since the day it came out and still see the issue. We are seeing it less frequently, but its still happening. I've stopped hearing back from Intel support via my actual ticket so I'm not holding out too much hope on that front.

                                  • 14. Re: Issue with "Detected Tx Unit Hang" dropping network connections
                                    ricbartm

                                    Hi,

                                     

                                    We finally solved the issue by disabling LRO. Make your own tests and see the results. We have been running for more than 15 days without issues, while without this configuration NIC started to get hangs after 7 days, several days a day.

                                     

                                    In order to disable LRO (Debian)

                                    • Put "options ixgbe LRO=0,0" into /etc/modprobe.d/ixgbe.conf
                                    • Don't forget to execute "update-initramfs"

                                     

                                    Reboot the server and make sure LRO is disabled:

                                    • ethtool -k eth0 | grep large-receive-offload

                                     

                                    Keep in mind these instructions work using kernel 3.10.5 and kernel module 3.18.7.

                                     

                                    If you believe disabling LRO will affect performance, we haven't seen any performance degradation (i.e. higher CPU usage with our traffic pattern  - Layer7 HTTP load balancers). If you take a look at the README inside latest kernel module driver downloaded from SourceForge or Intel you can see a mention to a feature named RSC (Receive Side Coalescing) which does the same thing but a lower level, figuring out which packets belong to the same TCP flow and aggregating them in order to reduce amount of packets. You can find an Intel's paper from 2006 around internet (search for "Receive Side Coalescing srihari.makineni AT intel.com").

                                     

                                    Regards,