5 Replies Latest reply on Dec 8, 2016 4:57 PM by Intel Corporation

    Low crypto performance with QAT 8950

    YinXu

      Hello all,

           I've installed a dh8950 on my CentOS6.5 with kernel version of 3.12.8, and suffering with the low performance of this card. Here's my test results:

      1. Running the cpa_sample_code in kernel space can reach to 50G throughput(print by the code), which meet the Intel's declaration.

      2. Running the cpa_sample_code in user space can reach to 9T throughput(also print by the code), which is really ridiculous, I don't believe that.

      3. According to 1 & 2, I think that the code running in the kernel space can print a relative correct information, so based on this judgement, I did my test below.

      4. First, I was interested in the symmetric crypto function of this hardware accelerator,  so all my test and results are saying about this function.

      5. I changed the loop numbers in the main function in cap_sample_code to make the code just send only 2 submissions to QAT, and then print the CPU cycles cost from perform the operation to QAT until got the whole 2 response, and the results are about 100,000. So each submission may cost 50,000 cycles.

      6. Based on 5, changed the submission number to 20, and this still cost 100,000 cycles, each submissions cost about 5000 cycles.

      7. Increase the submissions number, the average CPU cycle cost will decrease. When send 600,000 submission to QAT, the average cycle cost of each submissions can down to 500.

      8. I also wrote a test code based on the ipsec sample code, optimized it according to the performance guide, running at async mode, after calling the cpaPerformOp function, loop to poll the instance until got the QAT response(the callback function be called), and the CPU cycles between the time calling the  cpaPerformOp until got the response can down to 15,000.

           Here comes my problem:

      1. What's the relative between CPU cycles and submission number? And why?

      2. My application need to encrypt/decrypt the packet one by one, which means I have to wait for the response form QAT until I can send the next submissions to it, and according the test result, this may cost at least 15,000 CPU cycles, this is intolerable. Can this cost be cut down to 3,000 by any methods?

       

      Thanks sincerely!