0 Replies Latest reply on Jun 26, 2013 9:54 PM by JimJob

    NetEffect 10Gb Server 2012 SMBDirect BSOD under load (0x0D1)

    JimJob

           I've built a proof of concept environment for Server 2012, Scale Out File Servers, SMBDirect (using iWarp) and Hyper-V 2012 nodes.

       

      Essentially I've got 4 Scale out file servers that host the Fiberchannel CSV volumes (with CSV Caching enabled, 20% RAM), then share out the storage via Continuously Available shares using SMB3 and RDMA/SMBDirect.

       

      Each File Server (4) and Each Hyper-V Server (6) have single 10Gb RDMA adapters (Hyper-V servers also use dedicated X520-DA2 NICs for VM networking).  File & Hyper-V server RDMA adapters are on the same L2 VLAN on a common Cisco Nexus 5K switch.

       

      Everything was working pretty well, until I reached about 250 concurrent VMs.  Periodically, a file server node would BSOD (0x0D1, IRQ_NOT_LESS_OR_EQUAL, smbdirect.sys).  But the file cluster handled these failures gracefully.

       

      As I increased load further, Hyper-V servers started failing with the same error.

       

      At one point in load, the hyper-v failures would cause VMs to fail over to other nodes, cause great load, and BSOD them (in a cascade that even happened in the file servers).

       

      I was able to stabilize the environment by disabling NetworkDirect in the Adapter properties (essentially turning off RDMA), and have taken the workload to over 535 running VMs.

       

      While I understand that the crashdump isn't directly pointing to the N2E63x64.sys driver, these errors are typically driver related.  I am using the "latest" drivers (v1.185.11.11, 10/19/2012) and the issue only appears at load.  I am fully patch compliant and have installed all recommended 2012 & Hyper-V Cluster Hotfixes outlined in KB2784261.

       

      File servers are HP380G6 Servers (2x L5630, 48GB RAM), and Hyper-V servers are HP585G7 Servers (4x AMD 6172, 256GB RAM) Latest BIOS, drivers from HP.

       

      Has anyone else seen similar behavior?  And most importantly... how do we fix it?

       

      Thanks!