4 Replies Latest reply on Feb 11, 2016 12:23 AM by John_obn

    i40e XL710 hang up - tx_timeout hung_queue - ubuntu

    John_obn

      Hello,

      We have installed PC with Ubuntu 14.04.3 with all updates as Border router:

      Linux hellnat 3.19.0-47-generic #53~14.04.1-Ubuntu SMP Mon Jan 18 16:09:14 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux

      CPU: 2*E5-2690v3 with hyperthreading enabled (so total 48 logical "cores" in OS)

      Intel XL710 quad port, every "channel" of every p1p* interface is binded to its core

      It is used as border router, so it uses BGP. We use p1p1 and p1p3 to connect to internal routers and p1p2 and p1p3 - to Uplinks.

      Suddenly traffic stopped when it was NOT rush hour.

      zabb1.png

      zabb2.png

      After reboot (via IPMI) I saw next lines in syslog file:

      Jan 31 02:33:33 hellnat kernel: [220504.793680] ------------[ cut here ]------------

      Jan 31 02:33:33 hellnat kernel: [220504.793701] WARNING: CPU: 45 PID: 0 at /build/linux-lts-vivid-Yt59dr/linux-lts-vivid-3.19.0/net/sched/sch_generic.c:303 dev_watchdog+0x24f/0x260()

      Jan 31 02:33:33 hellnat kernel: [220504.793705] NETDEV WATCHDOG: p1p1 (i40e): transmit queue 8 timed out

      Jan 31 02:33:33 hellnat kernel: [220504.793707] Modules linked in: nf_conntrack_netlink nfnetlink xt_tcpudp xt_multiport iptable_filter xt_nat iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 iptable_mangle xt_CT iptable_raw ast ttm joydev intel_rapl iosf_mbi drm_kms_helper x86_pkg_temp_thermal intel_powerclamp drm syscopyarea sysfillrect sysimgblt coretemp kvm_intel kvm crct10dif_pclmul crc32_pclmul aesni_intel ipmi_ssif aes_x86_64 lrw gf128mul glue_helper ablk_helper cryptd lpc_ich mei_me sb_edac edac_core mei ipmi_si 8250_fintek ipmi_msghandler lp wmi acpi_pad parport ioatdma mac_hid shpchp nf_conntrack_ftp acpi_power_meter nf_nat_pptp nf_nat_proto_gre nf_conntrack_pptp nf_conntrack_proto_gre nf_nat nf_conntrack ip_tables x_tables 8021q garp mrp stp llc tcp_htcp hid_generic i40e(OE) igb vxlan ip6_udp_tunnel i2c_algo_bit udp_tunnel usbhid dca uas configfs ahci ptp usb_storage hid megaraid_sas libahci pps_core

      Jan 31 02:33:33 hellnat kernel: [220504.793817] CPU: 45 PID: 0 Comm: swapper/45 Tainted: G           OE  3.19.0-47-generic #53~14.04.1-Ubuntu

      Jan 31 02:33:33 hellnat kernel: [220504.793820] Hardware name: Supermicro SYS-6018R-WTR/X10DRW-i, BIOS 1.1 08/13/2015

      Jan 31 02:33:33 hellnat kernel: [220504.793822]  ffffffff81b3fcc0 ffff88105f4a3d58 ffffffff817afcd5 0000000000000000

      Jan 31 02:33:33 hellnat kernel: [220504.793827]  ffff88105f4a3da8 ffff88105f4a3d98 ffffffff81074dea 0000000000000286

      Jan 31 02:33:33 hellnat kernel: [220504.793830]  0000000000000008 ffff88105b65a000 0000000000000040 ffff88105748cf40

      Jan 31 02:33:33 hellnat kernel: [220504.793835] Call Trace:

      Jan 31 02:33:33 hellnat kernel: [220504.793837]  <IRQ>  [<ffffffff817afcd5>] dump_stack+0x45/0x57

      Jan 31 02:33:33 hellnat kernel: [220504.793857]  [<ffffffff81074dea>] warn_slowpath_common+0x8a/0xc0

      Jan 31 02:33:33 hellnat kernel: [220504.793860]  [<ffffffff81074e66>] warn_slowpath_fmt+0x46/0x50

      Jan 31 02:33:33 hellnat kernel: [220504.793869]  [<ffffffff816cd69f>] dev_watchdog+0x24f/0x260

      Jan 31 02:33:33 hellnat kernel: [220504.793874]  [<ffffffff816cd450>] ? dev_graft_qdisc+0x80/0x80

      Jan 31 02:33:33 hellnat kernel: [220504.793879]  [<ffffffff810dac79>] call_timer_fn+0x39/0x110

      Jan 31 02:33:33 hellnat kernel: [220504.793883]  [<ffffffff816cd450>] ? dev_graft_qdisc+0x80/0x80

      Jan 31 02:33:33 hellnat kernel: [220504.793888]  [<ffffffff810dc440>] run_timer_softirq+0x220/0x320

      Jan 31 02:33:33 hellnat kernel: [220504.793898]  [<ffffffff8104a403>] ? lapic_next_deadline+0x33/0x40

      Jan 31 02:33:33 hellnat kernel: [220504.793905]  [<ffffffff81078f44>] __do_softirq+0xe4/0x270

      Jan 31 02:33:33 hellnat kernel: [220504.793909]  [<ffffffff8107930d>] irq_exit+0x9d/0xb0

      Jan 31 02:33:33 hellnat kernel: [220504.793916]  [<ffffffff817ba78a>] smp_apic_timer_interrupt+0x4a/0x60

      Jan 31 02:33:33 hellnat kernel: [220504.793924]  [<ffffffff817b87bd>] apic_timer_interrupt+0x6d/0x80

      Jan 31 02:33:33 hellnat kernel: [220504.793926]  <EOI>  [<ffffffff81650510>] ? cpuidle_enter_state+0x70/0x170

      Jan 31 02:33:33 hellnat kernel: [220504.793938]  [<ffffffff816504fd>] ? cpuidle_enter_state+0x5d/0x170

      Jan 31 02:33:33 hellnat kernel: [220504.793943]  [<ffffffff816506c7>] cpuidle_enter+0x17/0x20

      Jan 31 02:33:33 hellnat kernel: [220504.793949]  [<ffffffff810b54d4>] cpu_startup_entry+0x334/0x3d0

      Jan 31 02:33:33 hellnat kernel: [220504.793955]  [<ffffffff810e9e53>] ? clockevents_register_device+0xe3/0x140

      Jan 31 02:33:33 hellnat kernel: [220504.793960]  [<ffffffff81048bb7>] start_secondary+0x197/0x1c0

      Jan 31 02:33:33 hellnat kernel: [220504.793963] ---[ end trace 43e1a051ade0289e ]---

      Jan 31 02:33:33 hellnat kernel: [220504.793973] i40e 0000:81:00.0 p1p1: tx_timeout: VSI_seid: 399, Q 8, NTC: 0xd36, HWB: 0xa1, NTU: 0xa1, TAIL: 0xa1, INT: 0x0

      Jan 31 02:33:33 hellnat kernel: [220504.793976] i40e 0000:81:00.0 p1p1: tx_timeout recovery level 1, hung_queue 8

      Jan 31 02:33:43 hellnat watchquagga[2972]: zebra state -> unresponsive : no response yet to ping sent 10 seconds ago

      Jan 31 02:33:49 hellnat watchquagga[2972]: bgpd state -> unresponsive : no response yet to ping sent 10 seconds ago

      Jan 31 02:33:50 hellnat kernel: [220521.908228] NMI watchdog: BUG: soft lockup - CPU#13 stuck for 23s! [kworker/13:1:536]

      Jan 31 02:33:50 hellnat kernel: [220521.908306] Modules linked in: nf_conntrack_netlink nfnetlink xt_tcpudp xt_multiport iptable_filter xt_nat iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 iptable_mangle xt_CT iptable_raw ast ttm joydev intel_rapl iosf_mbi drm_kms_helper x86_pkg_temp_thermal intel_powerclamp drm syscopyarea sysfillrect sysimgblt coretemp kvm_intel kvm crct10dif_pclmul crc32_pclmul aesni_intel ipmi_ssif aes_x86_64 lrw gf128mul glue_helper ablk_helper cryptd lpc_ich mei_me sb_edac edac_core mei ipmi_si 8250_fintek ipmi_msghandler lp wmi acpi_pad parport ioatdma mac_hid shpchp nf_conntrack_ftp acpi_power_meter nf_nat_pptp nf_nat_proto_gre nf_conntrack_pptp nf_conntrack_proto_gre nf_nat nf_conntrack ip_tables x_tables 8021q garp mrp stp llc tcp_htcp hid_generic i40e(OE) igb vxlan ip6_udp_tunnel i2c_algo_bit udp_tunnel usbhid dca uas configfs ahci ptp usb_storage hid megaraid_sas libahci pps_core

      Jan 31 02:33:50 hellnat kernel: [220521.908396] CPU: 13 PID: 536 Comm: kworker/13:1 Tainted: G        W  OE  3.19.0-47-generic #53~14.04.1-Ubuntu

      Jan 31 02:33:50 hellnat kernel: [220521.908399] Hardware name: Supermicro SYS-6018R-WTR/X10DRW-i, BIOS 1.1 08/13/2015

      Jan 31 02:33:50 hellnat kernel: [220521.908408] Workqueue: events inet_frag_worker

       

       

      The main lines , I think, are:

      Jan 31 02:33:33 hellnat kernel: [220504.793705] NETDEV WATCHDOG: p1p1 (i40e): transmit queue 8 timed out

      Jan 31 02:33:33 hellnat kernel: [220504.793973] i40e 0000:81:00.0 p1p1: tx_timeout: VSI_seid: 399, Q 8, NTC: 0xd36, HWB: 0xa1, NTU: 0xa1, TAIL: 0xa1, INT: 0x0

      Jan 31 02:33:33 hellnat kernel: [220504.793976] i40e 0000:81:00.0 p1p1: tx_timeout recovery level 1, hung_queue 8

       

      We can see that tx queue 8 hang up. Why can it happen? I think it is a problem of network adapter or driver. Can you explain it to me and how to fix it? It is big problem when it happens because all traffic is going through this machine.

      Some information from ethtool:

      # ethtool -i p1p1

      driver: i40e

      version: 1.3.49

      firmware-version: 4.53 0x80001da6 0.0.0

      bus-info: 0000:81:00.0

      supports-statistics: yes

      supports-test: yes

      supports-eeprom-access: yes

      supports-register-dump: yes

      supports-priv-flags: yes

       

      # ethtool -c p1p1

      Coalesce parameters for p1p1:

      Adaptive RX: off  TX: off

      stats-block-usecs: 0

      sample-interval: 0

      pkt-rate-low: 0

      pkt-rate-high: 0

      rx-usecs: 800

      rx-frames: 0

      rx-usecs-irq: 0

      rx-frames-irq: 256

      tx-usecs: 600

      tx-frames: 0

      tx-usecs-irq: 0

      tx-frames-irq: 256

      rx-usecs-low: 0

      rx-frame-low: 0

      tx-usecs-low: 0

      tx-frame-low: 0

      rx-usecs-high: 0

      rx-frame-high: 0

      tx-usecs-high: 0

      tx-frame-high: 0

       

      # ethtool -k p1p1

      Features for p1p1:

      rx-checksumming: on

      tx-checksumming: on

        tx-checksum-ipv4: on

        tx-checksum-ip-generic: off [fixed]

        tx-checksum-ipv6: on

        tx-checksum-fcoe-crc: off [fixed]

        tx-checksum-sctp: on

      scatter-gather: on

        tx-scatter-gather: on

        tx-scatter-gather-fraglist: off [fixed]

      tcp-segmentation-offload: off

        tx-tcp-segmentation: off

        tx-tcp-ecn-segmentation: off

        tx-tcp6-segmentation: off

      udp-fragmentation-offload: off [fixed]

      generic-segmentation-offload: off

      generic-receive-offload: off

      large-receive-offload: off [fixed]

      rx-vlan-offload: on

      tx-vlan-offload: on

      ntuple-filters: on

      receive-hashing: on

      highdma: on

      rx-vlan-filter: on

      vlan-challenged: off [fixed]

      tx-lockless: off [fixed]

      netns-local: off [fixed]

      tx-gso-robust: off [fixed]

      tx-fcoe-segmentation: off [fixed]

      tx-gre-segmentation: off [fixed]

      tx-ipip-segmentation: off [fixed]

      tx-sit-segmentation: off [fixed]

      tx-udp_tnl-segmentation: on

      fcoe-mtu: off [fixed]

      tx-nocache-copy: off

      loopback: off [fixed]

      rx-fcs: off [fixed]

      rx-all: off [fixed]

      tx-vlan-stag-hw-insert: off [fixed]

      rx-vlan-stag-hw-parse: off [fixed]

      rx-vlan-stag-filter: off [fixed]

      l2-fwd-offload: off [fixed]

      busy-poll: off [fixed]

       

       

      If you need more information feel free to ask it.

      Thank you in advance.

       

      Regards,

      Evgeny