7 Replies Latest reply on Mar 31, 2015 11:39 PM by Sandy_Intel

    i350-T4 Windows Server 2012 R2 VMQ blue screens during live migration

    jayee

      i350-T4 NIC

      Windows Server 2012 R2

      All 4 i350 ports configured as a Windows LBFO Team (switch independent / dynamic load balancing)

      Converged networking (HyperV vSwitch bound to the LBFO team, with vNICs configured on the vSwitch for Host OS operations (Management, Cluster/CSV. and Live Migration)

      VLAN tagging in use on VM's and vNICs except the vNIC used for management which is 'native'

      VMQ enabled on all i350 ports

      SR-IOV disabled on all i350 ports

      Server 2012 R2 HyperV cluster

      Fully patched with update rollups and hotfixes currently available

      Drivers 19.3 (latest from intel website)

       

      In the above configuration the destination server blue screens during live migration. I can sometimes get 1 live migration to work, but a second attempt to live migrate a different VM to the same destination host will cause the host to blue screen.

       

      I can reproduce this issue very easily on any host in the cluster. They all have the same behaviour

       

      If i disable VMQ then the issue stops

       

      Also we dont see this issue with thie same hardware and same configuration using Server 2012 (non R2) though i note that the NIC driver is diferent on this Server 2012 (e1r63x64.sys on 2012 as opposed to e1r64x64.sys on 2012 R2)

       

      crashdup analysis always shows the faulting driver as e1r64x64.sys

       

      BugCheck 1E, {ffffffffc0000005, fffff802be6a2550, ffffd000575b3b58, ffffd000575b3360}

      *** ERROR: Module load completed but symbols could not be loaded for e1r64x64.sys
      Probably caused by : e1r64x64.sys ( e1r64x64+280e7 )

      Followup: MachineOwner
      ---------

      18: kd> !analyze -v
      *******************************************************************************
      *                                                                             *
      *                        Bugcheck Analysis                                    *
      *                                                                             *
      *******************************************************************************

      KMODE_EXCEPTION_NOT_HANDLED (1e)
      This is a very common bugcheck.  Usually the exception address pinpoints
      the driver/function that caused the problem.  Always note this address
      as well as the link date of the driver/image that contains this address.
      Arguments:
      Arg1: ffffffffc0000005, The exception code that was not handled
      Arg2: fffff802be6a2550, The address that the exception occurred at
      Arg3: ffffd000575b3b58, Parameter 0 of the exception
      Arg4: ffffd000575b3360, Parameter 1 of the exception

      Debugging Details:
      ------------------


      WRITE_ADDRESS: unable to get nt!MmNonPagedPoolStart
      unable to get nt!MmSizeOfNonPagedPoolInBytes
      ffffd000575b3360

      EXCEPTION_CODE: (NTSTATUS) 0xc0000005 - The instruction at 0x%08lx referenced memory at 0x%08lx. The memory could not be %s.

      FAULTING_IP:
      nt!ExQueryDepthSList+0
      fffff802`be6a2550 8b01            mov     eax,dword ptr [rcx]

      EXCEPTION_PARAMETER1:  ffffd000575b3b58

      EXCEPTION_PARAMETER2:  ffffd000575b3360

      BUGCHECK_STR:  0x1E_c0000005

      DEFAULT_BUCKET_ID:  WIN8_DRIVER_FAULT

      PROCESS_NAME:  System

      CURRENT_IRQL:  0

      ANALYSIS_VERSION: 6.3.9600.17237 (debuggers(dbg).140716-0327) amd64fre

      EXCEPTION_RECORD:  0000000000000001 -- (.exr 0x1)
      Cannot read Exception record @ 0000000000000001

      TRAP_FRAME:  ffffe800b6200000 -- (.trap 0xffffe800b6200000)
      Unable to read trap frame at ffffe800`b6200000

      LAST_CONTROL_TRANSFER:  from fffff802be7efefb to fffff802be768ca0

      STACK_TEXT: 
      ffffd000`575b2b38 fffff802`be7efefb : 00000000`0000001e ffffffff`c0000005 fffff802`be6a2550 ffffd000`575b3b58 : nt!KeBugCheckEx
      ffffd000`575b2b40 fffff802`be779846 : 00000000`00000000 fffff800`35d0c991 ffffe800`b1172d02 ffffd000`575b2e29 : nt!KiFatalFilter+0x1f
      ffffd000`575b2b80 fffff802`be757d56 : 00000000`00000000 fffff802`be6e19a6 ffffe000`516d3f90 00000000`00000000 : nt! ?? ::FNODOBFM::`string'+0x696
      ffffd000`575b2bc0 fffff802`be7701ed : 00000000`00000000 ffffd000`575b2d60 ffffd000`575b3b58 ffffd000`575b2d60 : nt!_C_specific_handler+0x86
      ffffd000`575b2c30 fffff802`be6fd3a5 : 00000000`00000001 fffff802`be615000 ffffd000`575b3b00 fffff800`00000000 : nt!RtlpExecuteHandlerForException+0xd
      ffffd000`575b2c60 fffff802`be6fc25f : ffffd000`575b3b58 ffffd000`575b3860 ffffd000`575b3b58 ffffe800`b12ee480 : nt!RtlDispatchException+0x1a5
      ffffd000`575b3330 fffff802`be7748c2 : 00000000`00000001 fffffa80`1b6de000 ffffe800`b6200000 00000000`00000000 : nt!KiDispatchException+0x61f
      ffffd000`575b3a20 fffff802`be772dfe : 00000000`00000011 00000000`00000002 00000000`00000001 fffff802`be8a929a : nt!KiExceptionDispatch+0xc2
      ffffd000`575b3c00 fffff802`be6a2550 : fffff800`35d04875 ffffe800`b0f3c870 ffffd000`575b3e00 ffffe000`517cd000 : nt!KiGeneralProtectionFault+0xfe
      ffffd000`575b3d98 fffff800`35d04875 : ffffe800`b0f3c870 ffffd000`575b3e00 ffffe000`517cd000 00000000`00000000 : nt!ExQueryDepthSList
      ffffd000`575b3da0 fffff800`372520e7 : ffffe000`517ce540 ffffe000`517cd000 ffffe800`b1496c60 00000000`00000000 : NDIS!NdisFreeNetBufferList+0xb5
      ffffd000`575b3e20 fffff800`372528a9 : ffffe000`517ce540 ffffe000`517cd000 00000000`00000001 00000000`00000000 : e1r64x64+0x280e7
      ffffd000`575b3e50 fffff800`37252c00 : ffffe000`517ce540 00000000`00000001 00000000`00000000 ffffe000`517cd000 : e1r64x64+0x288a9
      ffffd000`575b3e90 fffff800`37264a9d : ffffe000`517cd000 ffffe000`00000001 ffffe000`00000001 ffff0001`00000001 : e1r64x64+0x28c00
      ffffd000`575b3ec0 fffff800`37261c7b : 00000000`00000000 ffffd000`575469a0 ffffe000`517cd000 00000000`00000000 : e1r64x64+0x3aa9d
      ffffd000`575b3f00 fffff800`3725a909 : 00000000`00000002 00000000`00000000 ffffe000`517cd000 ffffd000`575469a0 : e1r64x64+0x37c7b
      ffffd000`575b3f50 fffff800`3725b02b : ffffe800`b528cde0 fffff800`35d04671 ffffd000`575b40f0 ffffe000`51105ad0 : e1r64x64+0x30909
      ffffd000`575b3fc0 fffff800`35d8f0fa : ffffe800`b5b87868 ffffe800`b5b87858 ffffe800`b5b87854 ffffe800`b0d501a0 : e1r64x64+0x3102b
      ffffd000`575b4030 fffff800`35d033a3 : ffffe800`b0d501a0 ffffd000`575b40e9 ffffe800`b5b87820 00000000`00000011 : NDIS!ndisMInvokeOidRequest+0x4e
      ffffd000`575b4070 fffff800`35d04324 : 00000000`00000000 ffffe800`b0d501a0 ffffe800`b5b87868 00000000`00000000 : NDIS!ndisMDoOidRequest+0x39b
      ffffd000`575b4150 fffff800`35d0475e : 00000000`00000000 00000000`00000000 00000000`00000000 00000000`00000000 : NDIS!ndisQueueOidRequest+0x4c4
      ffffd000`575b42f0 fffff800`3679719e : ffffe800`b147b8c0 00000000`00010224 ffffe800`b147b8c0 ffffe000`52bf4010 : NDIS!NdisFOidRequest+0xc2
      ffffd000`575b43b0 fffff800`35d038de : ffffe800`b5b87820 ffffe000`51105ad0 00000000`00000000 ffffe000`52bea010 : wfplwfs!LwfLowerOidRequest+0x6e
      ffffd000`575b43e0 fffff802`be6e19a6 : ffffd000`575b46d0 ffffd000`575af000 00000000`00000000 00000000`00000000 : NDIS!ndisFDoOidRequestInternal+0x2ee
      ffffd000`575b44e0 fffff800`35d04131 : fffff800`35d035f0 ffffe000`52bea010 ffffe800`b1a0b400 00000000`00000000 : nt!KeExpandKernelStackAndCalloutInternal+0xe6
      ffffd000`575b45d0 fffff800`35d03d27 : 00000000`00000102 ffffd000`53203200 00000000`00000000 ffffd000`575467d0 : NDIS!ndisQueueOidRequest+0x2d1
      ffffd000`575b4770 fffff800`372ea204 : 00000000`00000120 ffffe000`516d4000 00000000`00000120 ffffe000`52bf5000 : NDIS!ndisMOidRequest+0x193
      ffffd000`575b4880 fffff800`372e858d : ffffe000`5200ff00 ffffd000`00000001 ffffe000`52bf5020 ffffe800`b5b87820 : NdisImPlatform!implatDoOidRequestOnAdapter+0x22c
      ffffd000`575b4900 fffff800`372ea32c : ffffe800`b1ae3880 fffff802`be6546c9 ffffe000`52bf5000 00000000`00000000 : NdisImPlatform!implatOidRequestInternal+0x1fd
      ffffd000`575b4ac0 fffff802`be650f4a : ffffe800`b1b54ca0 ffffe000`52c10050 ffffe000`52c10050 fffff800`6977444e : NdisImPlatform!implatOidRequestWorkItem+0x24
      ffffd000`575b4af0 fffff802`be651a2b : fffff800`362ed330 fffff802`be650ed4 ffffd000`575b4bd0 ffffe800`b1b54ca0 : nt!IopProcessWorkItem+0x76
      ffffd000`575b4b50 fffff802`be6ee514 : 00000000`00000000 ffffe800`b1ae3880 ffffe800`b1ae3880 ffffe000`50832900 : nt!ExpWorkerThread+0x293
      ffffd000`575b4c00 fffff802`be76f2c6 : ffffd000`55503180 ffffe800`b1ae3880 ffffd000`5550f7c0 00000014`00000006 : nt!PspSystemThreadStartup+0x58
      ffffd000`575b4c60 00000000`00000000 : ffffd000`575b5000 ffffd000`575af000 00000000`00000000 00000000`00000000 : nt!KiStartSystemThread+0x16


      STACK_COMMAND:  kb

      FOLLOWUP_IP:
      e1r64x64+280e7
      fffff800`372520e7 813ddfd0030001000500 cmp dword ptr [e1r64x64+0x651d0 (fffff800`3728f1d0)],50001h

      SYMBOL_STACK_INDEX:  b

      SYMBOL_NAME:  e1r64x64+280e7

      FOLLOWUP_NAME:  MachineOwner

      MODULE_NAME: e1r64x64

      IMAGE_NAME:  e1r64x64.sys

      DEBUG_FLR_IMAGE_TIMESTAMP:  531f9173

      FAILURE_BUCKET_ID:  0x1E_c0000005_e1r64x64+280e7

      BUCKET_ID:  0x1E_c0000005_e1r64x64+280e7

      ANALYSIS_SOURCE:  KM

      FAILURE_ID_HASH_STRING:  km:0x1e_c0000005_e1r64x64+280e7

      FAILURE_ID_HASH:  {6d380028-1764-7d25-d8c5-05559a475808}

       

      So it seems that this intel driver has issues with VMQ.

      VMQ is quite famous for NIC vendors and buggy drivers in Server 2012 R2

       

      Disabling VMQ is not an option for us in production. We need it to work

      Can anyone please confirm this issue exists on the latest 19.3 driver in Server 2012 R2?

      Any idea when it will get fixed?

       

      I'm shocked that such an awful bug would exist 12 months after launch of Server 2012 R2 on latest intel drivers for a technology that MS and Intel co-developed.

      I would expect this kind of thing from Broadcom, i wouldnt expect it from Intel. Thats why we buy Intel

      Perhaps we made a mistake there...

       

      Help and comments appreciated