6 Replies Latest reply on Mar 3, 2012 8:52 AM by alex@zadarastorage.com

    82599EB: DRHD & DMAR faults, followed by Detected Tx Unit Hang

    alex@zadarastorage.com

      Hello everybody,

      we're running ubuntu-natty 2.6.38-13.53 kernel, with ixgbe 3.2.9-k2 and ixgbevf 1.0.19-k0 drivers. We use 82599EB dual-port NICs. Each port spawns 10 VFs, which are further attached to virtual machines with KVM.

      Frequently we experience network failures, which start like this:

       

      Dec 22 14:41:07 ccmaster kernel: [190048.835136] DRHD: handling fault status reg 2
      Dec 22 14:41:07 ccmaster kernel: [190048.864523] DMAR:[DMA Read] Request device [03:11.7] fault addr 79634000
      Dec 22 14:41:07 ccmaster kernel: [190048.864525] DMAR:[fault reason 06] PTE Read access is not set
      Dec 22 14:41:07 ccmaster kernel: [190049.014923] DRHD: handling fault status reg 102
      Dec 22 14:41:07 ccmaster kernel: [190049.044511] DMAR:[DMA Read] Request device [03:11.7] fault addr 79634000
      Dec 22 14:41:07 ccmaster kernel: [190049.044513] DMAR:[fault reason 06] PTE Read access is not set
      Dec 22 14:41:08 ccmaster kernel: [190050.355215] DRHD: handling fault status reg 202
      Dec 22 14:41:08 ccmaster kernel: [190050.385040] DMAR:[DMA Read] Request device [03:11.7] fault addr 77a92000
      Dec 22 14:41:08 ccmaster kernel: [190050.385041] DMAR:[fault reason 06] PTE Read access is not set
      Dec 22 14:41:09 ccmaster kernel: [190051.007798] ixgbe 0000:03:00.1: eth103: VF Reset msg received from vf 2
      Dec 22 14:41:09 ccmaster kernel: [190051.043515] ixgbe 0000:03:00.1: eth103: VF Reset msg received from vf 2
      Dec 22 14:41:09 ccmaster kernel: [190051.471541] ixgbe 0000:03:00.1: eth103: VF Reset msg received from vf 7
      Dec 22 14:41:09 ccmaster kernel: [190051.510908] ixgbe 0000:03:00.1: eth103: VF Reset msg received from vf 7
      Dec 22 14:41:10 ccmaster kernel: [190051.885971] ixgbe 0000:03:00.1: eth103: VF Reset msg received from vf 3
      Dec 22 14:41:10 ccmaster kernel: [190051.923664] ixgbe 0000:03:00.1: eth103: VF Reset msg received from vf 6
      Dec 22 14:41:10 ccmaster kernel: [190051.925334] ixgbe 0000:03:00.1: eth103: VF Reset msg received from vf 3
      Dec 22 14:41:10 ccmaster kernel: [190051.964411] ixgbe 0000:03:00.1: eth103: VF Reset msg received from vf 6
      Dec 22 14:41:10 ccmaster kernel: [190052.195640] ixgbe 0000:03:00.1: eth103: VF Reset msg received from vf 4
      Dec 22 14:41:10 ccmaster kernel: [190052.235159] ixgbe 0000:03:00.1: eth103: VF Reset msg received from vf 4
      Dec 22 14:41:11 ccmaster kernel: [190053.001909] ixgbe 0000:03:00.1: eth103: VF Reset msg received from vf 2
      Dec 22 14:41:11 ccmaster kernel: [190053.040401] ixgbe 0000:03:00.1: eth103: VF Reset msg received from vf 2
      Dec 22 14:41:12 ccmaster kernel: [190053.882821] ixgbe 0000:03:00.1: eth103: VF Reset msg received from vf 3
      Dec 22 14:41:12 ccmaster kernel: [190053.920700] ixgbe 0000:03:00.1: eth103: VF Reset msg received from vf 6
      Dec 22 14:41:12 ccmaster kernel: [190053.922305] ixgbe 0000:03:00.1: eth103: VF Reset msg received from vf 3
      Dec 22 14:41:12 ccmaster kernel: [190053.960197] ixgbe 0000:03:00.1: eth103: VF Reset msg received from vf 6
      Dec 22 14:41:13 ccmaster kernel: [190054.612941] ixgbe 0000:03:00.1: eth103: Detected Tx Unit Hang
      Dec 22 14:41:13 ccmaster kernel: [190054.612943]   Tx Queue             <0>
      Dec 22 14:41:13 ccmaster kernel: [190054.612944]   TDH, TDT             <100>, <122>
      Dec 22 14:41:13 ccmaster kernel: [190054.612944]   next_to_use          <122>
      Dec 22 14:41:13 ccmaster kernel: [190054.612945]   next_to_clean        <102>
      Dec 22 14:41:13 ccmaster kernel: [190054.612946] tx_buffer_info[next_to_clean]
      Dec 22 14:41:13 ccmaster kernel: [190054.612946]   time_stamp           <10121fa2d>
      Dec 22 14:41:13 ccmaster kernel: [190054.612947]   jiffies              <10121fc55>
      Dec 22 14:41:13 ccmaster kernel: [190054.838626] ixgbe 0000:03:00.1: eth103: tx hang 1 detected on queue 0, resetting adapter
      Dec 22 14:41:13 ccmaster kernel: [190054.838782] ixgbe 0000:03:00.1: eth103: Reset adapter
      Dec 22 14:41:13 ccmaster kernel: [190054.866337] ixgbe 0000:03:00.1: eth103: RXDCTL.ENABLE on Rx queue 20 not cleared within the polling period
      Dec 22 14:41:13 ccmaster kernel: [190055.083995] br103: port 1(eth103) entering forwarding state
      Dec 22 14:41:13 ccmaster kernel: [190055.232255] ixgbe 0000:03:00.1: master disable timed out
      Dec 22 14:41:15 ccmaster kernel: [190057.418550] ixgbe 0000:03:00.1: eth103: NIC Link is Up 10 Gbps, Flow Control: RX/TX
      Dec 22 14:41:15 ccmaster kernel: [190057.420402] br103: port 1(eth103) entering forwarding state
      Dec 22 14:41:15 ccmaster kernel: [190057.420405] br103: port 1(eth103) entering forwarding state
      Dec 22 14:41:15 ccmaster kernel: [190057.451889] ixgbe 0000:03:00.1: eth103: VF Reset msg received from vf 7
      Dec 22 14:41:15 ccmaster kernel: [190057.491834] ixgbe 0000:03:00.1: eth103: VF Reset msg received from vf 7
      Dec 22 14:41:15 ccmaster kernel: [190057.538455] ixgbe 0000:03:00.1: eth103: NIC Link is Down
      Dec 22 14:41:16 ccmaster kernel: [190058.001181] DRHD: handling fault status reg 302
      Dec 22 14:41:16 ccmaster kernel: [190058.029084] DMAR:[DMA Read] Request device [03:11.7] fault addr 79634000
      Dec 22 14:41:16 ccmaster kernel: [190058.029086] DMAR:[fault reason 06] PTE Read access is not set

       

      Dec 22 14:41:16 ccmaster kernel: [190058.338892] DRHD: handling fault status reg 402
      Dec 22 14:41:16 ccmaster kernel: [190058.367063] DMAR:[DMA Read] Request device [03:11.7] fault addr 77a92000
      Dec 22 14:41:16 ccmaster kernel: [190058.367064] DMAR:[fault reason 06] PTE Read access is not set
      Dec 22 14:41:16 ccmaster kernel: [190058.508637] br103: port 1(eth103) entering forwarding state
      Dec 22 14:41:17 ccmaster kernel: [190058.874750] ixgbe 0000:03:00.1: eth103: VF Reset msg received from vf 3
      Dec 22 14:41:17 ccmaster kernel: [190058.874797] ixgbe 0000:03:00.1: eth103: NIC Link is Up 10 Gbps, Flow Control: RX/TX
      Dec 22 14:41:17 ccmaster kernel: [190058.876606] br103: port 1(eth103) entering forwarding state
      Dec 22 14:41:17 ccmaster kernel: [190058.876609] br103: port 1(eth103) entering forwarding state
      Dec 22 14:41:17 ccmaster kernel: [190058.912721] ixgbe 0000:03:00.1: eth103: VF Reset msg received from vf 6
      Dec 22 14:41:17 ccmaster kernel: [190058.914695] ixgbe 0000:03:00.1: eth103: VF Reset msg received from vf 3
      Dec 22 14:41:17 ccmaster kernel: [190058.952633] ixgbe 0000:03:00.1: eth103: VF Reset msg received from vf 6
      Dec 22 14:41:17 ccmaster kernel: [190058.981264] ixgbe 0000:03:00.1: eth103: VF Reset msg received from vf 2
      Dec 22 14:41:17 ccmaster kernel: [190059.021207] ixgbe 0000:03:00.1: eth103: VF Reset msg received from vf 2
      Dec 22 14:41:17 ccmaster kernel: [190059.184576] ixgbe 0000:03:00.1: eth103: VF Reset msg received from vf 4
      Dec 22 14:41:17 ccmaster kernel: [190059.224515] ixgbe 0000:03:00.1: eth103: VF Reset msg received from vf 4
      Dec 22 14:41:17 ccmaster kernel: [190059.449368] ixgbe 0000:03:00.1: eth103: VF Reset msg received from vf 7
      Dec 22 14:41:17 ccmaster kernel: [190059.488744] ixgbe 0000:03:00.1: eth103: VF Reset msg received from vf 7
      Dec 22 14:41:19 ccmaster kernel: [190060.872227] ixgbe 0000:03:00.1: eth103: VF Reset msg received from vf 3
      Dec 22 14:41:19 ccmaster kernel: [190060.911541] ixgbe 0000:03:00.1: eth103: VF Reset msg received from vf 3
      Dec 22 14:41:19 ccmaster kernel: [190060.912297] ixgbe 0000:03:00.1: eth103: VF Reset msg received from vf 6
      Dec 22 14:41:19 ccmaster kernel: [190060.949461] ixgbe 0000:03:00.1: eth103: VF Reset msg received from vf 6
      Dec 22 14:41:19 ccmaster kernel: [190060.987209] ixgbe 0000:03:00.1: eth103: VF Reset msg received from vf 2
      Dec 22 14:41:19 ccmaster kernel: [190061.018071] ixgbe 0000:03:00.1: eth103: VF Reset msg received from vf 2
      Dec 22 14:41:19 ccmaster kernel: [190061.446453] ixgbe 0000:03:00.1: eth103: VF Reset msg received from vf 7
      Dec 22 14:41:19 ccmaster kernel: [190061.485732] ixgbe 0000:03:00.1: eth103: VF Reset msg received from vf 7
      Dec 22 14:41:21 ccmaster kernel: [190062.872209] ixgbe 0000:03:00.1: eth103: VF Reset msg received from vf 3
      Dec 22 14:41:21 ccmaster kernel: [190062.908506] ixgbe 0000:03:00.1: eth103: VF Reset msg received from vf 3
      Dec 22 14:41:21 ccmaster kernel: [190062.914417] ixgbe 0000:03:00.1: eth103: VF Reset msg received from vf 6
      Dec 22 14:41:21 ccmaster kernel: [190062.946422] ixgbe 0000:03:00.1: eth103: VF Reset msg received from vf 6
      Dec 22 14:41:21 ccmaster kernel: [190062.983951] ixgbe 0000:03:00.1: eth103: VF Reset msg received from vf 2
      Dec 22 14:41:21 ccmaster kernel: [190063.014967] ixgbe 0000:03:00.1: eth103: VF Reset msg received from vf 2
      Dec 22 14:41:21 ccmaster kernel: [190063.179077] ixgbe 0000:03:00.1: eth103: VF Reset msg received from vf 4
      Dec 22 14:41:21 ccmaster kernel: [190063.218310] ixgbe 0000:03:00.1: eth103: VF Reset msg received from vf 4
      Dec 22 14:41:21 ccmaster kernel: [190063.447964] ixgbe 0000:03:00.1: eth103: VF Reset msg received from vf 7
      Dec 22 14:41:21 ccmaster kernel: [190063.482670] ixgbe 0000:03:00.1: eth103: VF Reset msg received from vf 7
      Dec 22 14:41:23 ccmaster kernel: [190064.871351] ixgbe 0000:03:00.1: eth103: VF Reset msg received from vf 3
      Dec 22 14:41:23 ccmaster kernel: [190064.905438] ixgbe 0000:03:00.1: eth103: VF Reset msg received from vf 3
      Dec 22 14:41:23 ccmaster kernel: [190064.911111] ixgbe 0000:03:00.1: eth103: VF Reset msg received from vf 6
      Dec 22 14:41:23 ccmaster kernel: [190064.943364] ixgbe 0000:03:00.1: eth103: VF Reset msg received from vf 6
      Dec 22 14:41:23 ccmaster kernel: [190064.980803] ixgbe 0000:03:00.1: eth103: VF Reset msg received from vf 2
      Dec 22 14:41:23 ccmaster kernel: [190065.011891] ixgbe 0000:03:00.1: eth103: VF Reset msg received from vf 2
      Dec 22 14:41:23 ccmaster kernel: [190065.445244] ixgbe 0000:03:00.1: eth103: VF Reset msg received from vf 7
      Dec 22 14:41:23 ccmaster kernel: [190065.479603] ixgbe 0000:03:00.1: eth103: VF Reset msg received from vf 7
      Dec 22 14:41:24 ccmaster kernel: [190065.626069] ixgbe 0000:03:00.1: eth103: Detected Tx Unit Hang
      Dec 22 14:41:24 ccmaster kernel: [190065.626071]   Tx Queue             <0>
      Dec 22 14:41:24 ccmaster kernel: [190065.626072]   TDH, TDT             <6>, <31>
      Dec 22 14:41:24 ccmaster kernel: [190065.626073]   next_to_use          <31>
      Dec 22 14:41:24 ccmaster kernel: [190065.626074]   next_to_clean        <7>
      Dec 22 14:41:24 ccmaster kernel: [190065.626075] tx_buffer_info[next_to_clean]
      Dec 22 14:41:24 ccmaster kernel: [190065.626076]   time_stamp           <10121fdb0>
      Dec 22 14:41:24 ccmaster kernel: [190065.626077]   jiffies              <1012200a4>
      Dec 22 14:41:24 ccmaster kernel: [190065.833461] ixgbe 0000:03:00.1: eth103: tx hang 2 detected on queue 0, resetting adapter
      Dec 22 14:41:24 ccmaster kernel: [190065.833546] ixgbe 0000:03:00.1: eth103: Reset adapter
      Dec 22 14:41:24 ccmaster kernel: [190065.857635] ixgbe 0000:03:00.1: eth103: RXDCTL.ENABLE on Rx queue 20 not cleared within the polling period

       

      14:41:26 ccmaster kernel: [190068.401926] scst[2044] scst_init_session[6287]: Using security group "iqn.2011-04.com.zadarastorage:288:vc-0" for initiator "iqn.2011-04.com.zadarastorage:288:vc-0"
      Dec 22 14:41:26 ccmaster kernel: [190068.402184] scst[29174] scst_cmd_thread[4294]: Processing thread zdr-0-22_0 (PID 29174) started
      Dec 22 14:41:26 ccmaster kernel: [190068.435081] ixgbe 0000:03:00.1: eth103: VF Reset msg received from vf 7

       

      Can anybody pls advise what is going wrong here, and what can we do further to debug this problem.

       

      Thanks,

        Alex.