1 Reply Latest reply on Aug 6, 2013 12:00 PM by sylvia_intel

    Local CPU may degrade Remote CPU performance on Packet Receiving

    KayZzz

      I have a server with 2 Intel Xeon CPU E5-2620 (Sandy Bridge) and a 10Gbps 82599 NIC (2 ports), which I used for high-performance computing. From the PCI affinity, I see that the 10G NIC is connected to CPU1. I launched several packet receiving thread to conduct experiments, the threads receives packets, do IP/UDP parsing, and copy into a buffer. The driver I used for 10G NIC is IOEngine PacketShader/Packet-IO-Engine · GitHub

       

      Q1 !     Idle CPU1 degrade CPU0 Packet receiving performance

       

      1.1) If 1 or 2 or 4 threads are bonded to CPU0, the overal performance of all threads is about 2.6-3.2Gbps

      1.2) If 2 threads are bonded to CPU1, the overal performance is 16.XGbps

      1.3) If 4 threads are bonded to CPU1, the overal performance is 19.XGbps (Maximum on 2 * 10G port)

       

      Since CPU0 is not directly connected with the NIC, it seems that the maximum receiving speed on CPU0 is 2.6-3.2Gbps. However I found if some computation intensive processes run on CPU1, the packet receiving threads on CPU0 boosts to 15.XGbps with 2 threads, and 19.XGbps with 4 threads.

       

      Is this due to the power management? If the CPU1 is idle, it will run in the power-saving mode? Even if it is, how can CPU1 influence the performance of CPU0? Is there are something I don't know about the QPI?

       

      Q2 !    Overloaded CPU1 degrade CPU0 Packet receiving performance

       

      2.1) If 1 packet receiving threads runs on CPU0, and 1 packet receiving threads runs on CPU1, the overal performance  is 10Gbps. The performance of each thread is almost the same -- 5.X Gbps.

      2.2) If 2 packet receiving threads runs on CPU0, and 2 packet receiving threads runs on CPU1, overal performance  is 13Gbps. And the performance of each thread is almost the same -- 3.X Gbps, which is lower than 2.1, 1.2, and 1.3

       

      In short, when receiving threads running on both CPU0 and CPU1, all the threads cannot achieve their maximum performance, and their performance is almost the same.

       

      I think that there is much I don't know about the NUMA and QPI, can anyone help me explain this ? Thanks