8 Replies Latest reply on Feb 15, 2018 5:36 PM by Intel Corporation

    Performance of P2P DMA PCIe packets routed by CPU?




      I like to know whether P2P DMA packets on PCIe bus routed by GPU have narrower bandwidth than Dev-to-RAM cases.



      When I run a sequential data transfer workload (read of SSD data blocks) using peer-to-peer DMA from triple Intel DC P4600 SSD (striped with md-raid0) to NVIDIA Tesla P40, it performed with worse throughput (7.1GB/s) than theoretical one (9.6GB/s),

      On the other hands, 3x Intel DC P4600 SSD configuration recorded 9.5GB/s throughput, when we tried SSD-to-RAM DMA with same kernel driver.


      GPU's device memory is mapped to PCI BAR1 region using NVIDIA GPUDirect RDMA. So, they have physical address of the host system, thus, we can use these addresses as destination address of NVME READ command.

      I wrote up a Linux kernel driver which intermediates direct data transfer between NVMe-SSD and GPU or host RAM.

      It constructs NVME READ commands to read a particular SSD blocks and to store them onto the specified destination address (that may be GPU's device memory), then enqueues the command into message queue of the inbox nvme driver.


      Likely, our Linux kernel module is not guilty, because it performs SSD-to-GPU P2P DMA with 6.3GB/s throughput on the dual SSDs configuration. Individual SSDs performs with 3.2GB/s throughput; equivalent to the catalog spec of DC P4600.



      1. Does Xeon E5-2650v4 (Broadwell-EP) processor have hardware limitation on the capability of peer-to-peer DMA packet routing less than the PCIe specification?
      2. If Broadwell-EP Xeon has such limitation on P2P DMA routing, is it improved at the Skylake-S (Xeon Scalable)?


      ...and, I hope folk's suggestion if you have any ideas to check more than CPU's capability.


      Best regards,


      * SSD-to-RAM works as expected (9.5GB/s)



      * SSD-to-GPU works slower than my expectation (7.1GB/s)