Performance is why you buy 10 Gigabit and 1 Gigabit products.  Performance doesn’t always just happen.  When should you make changes?  There is no magic formula for it.  Let’s look at the problem as a system from network to application.  The first level of buffering is at the switch.  Using flow control - or better yet Data Center Bridging with its end point to end point flow control - will help keep down the number of dropped packets by making sure the network is aware of the resources level of system.  With flow control you should only get what you have room for.  Once a packet gets to the system, it is first buffer is in our First in, First out (FIFO) storage.  If the bus is busy, or things in general are delaying the processing of packets in the host system, the FIFO can store both TX and RX packets.  The descriptors and host memory buffers are the next queuing place.  Next is the driver interface, since almost every driver this day is a miniport driver, and above that is the protocol stack.  Today that is mostly TCP/IP.  Finally things end up in the application.  Now this is a slightly simplified description and leaves out  components like our intermediate driver(ANS) which can do its own queuing but shouldn’t be holding any resource long enough to influence our discussion.   To tune the applications network performance, you need to tune all the layers in between.  For some O/Ses that is easy, some not so much.  Most modern operating systems will be dynamic and try to keep up, but some settings - like Maximum TCP Buffer - can be static and may need some manual tweaking.
     If you have a large number of packets incoming and they are being processed slowly, you might want to turn up the number of buffers.  This will increase the entry buffers, but the real problem is why the packets are being processed slowly above it.  In the older Windows days you could set some parameters to make it allocate more TCP/IP and NET buffers, but that seems to have gone away.  If you know some good ones, post them in the comments.  Having said that more can be better, here is why you shouldn’t always just put in more.  There is a cache of descriptors on chip that limits how many of the bigger list can be used at any given moment.  While the whole list is 2048, it will only fetch and put them in bunches of 64.  This is mostly for cache-line alignment and internal architecture reasons, but it can make the 2048 get consumed in fits and starts instead of a linear flow.  What this means is while 2048 buffers might be ready, only 64 will be consumed before it has to write those back and get new ones.  It isn’t always exactly 64, since it would be wasteful to wait for that many in a low traffic environment.  But what that does mean is while 64 are being processed, the other 1984 are sitting idle.  Idle is of course relative, since at 1G speed all 2048 descriptors can be read, used and put back in less than a millisecond.  But the point is still there. If the O/S can service and return the buffer back to the driver in under a millisecond, you don’t need more buffers since the data is moving fast enough.  If the buffers go to an upper layer like the stack or the application and sit, then more buffers is just going to treat a symptom and not the problem.  You are better off to keep digging for the root cause rather than just slapping on more buffers.  The stack is an important part of the equation, so check out its statistics and errors to see if your network is underperforming because of slowness there. 
     Other than more descriptors, what can I do?  Interrupt Throttle Rate can help.  There will be a separate article on that.  Make sure the data is going to the core that is going to be doing the work.  Use RSS and MSI-X to make sure that you’re not moving your data several times.  If you’re sending your traffic to a core that is saturated, consider moving some of what is running on that core to another core.  Process affinity is pretty easy to use and can make sure you’re keeping all those cores working evenly.  You might also consider updating the O/S as an option.   This is not always a very attractive option, but modern O/Ses are very aware of the loads that a network can bring, and the vendors listen to our suggestions like never before.  We saw major improvements moving from one O/S to a newer version from the same vendor.  I won’t name names since we all have our off days, but since the driver for our stuff didn’t change, it clearly pointed at the cause.  The application can also be a good source of tuning, so scour the apps support site for tuning ideas.
Let’s look at each bottleneck one at a time.  There will be hints on things to do and questions that need to be answered.
1.    Packet creation.  In the perfect performance model, zero clocks are spent on creating the packets.  Pre-existing packets will be faster than packets that require the CPU to touch them in memory, or worse yet move the data from ring 3 to ring 0.  Minimize this to maximize performance.  Understanding whether the data is static (created once) or dynamic (created at use) can influence how long it takes to create.
2.    Maximize the bus utilization.  Even though there is enough bandwidth on the bus for one port doing bidirectional traffic, some things can take up bandwidth which can cost performance.  Statistics are slow registers and there are a lot of them.  When trying to maximize performance, don’t access any statistics.  Just like the Stats registers, there are other “Slow” registers. When maximizing performance, leave all registers out of it.  Tail registers should be the only registers used.  Registers like VLAN, RAR, MCAST, TxCW, RxCW and all PHY registers are slow enough through the internal logic to cost packets.
3.    Data Locality.  Make sure the data (packet buffers and descriptors) are running on the same CPU as any work being done on them.  If work has to be scheduled between processors it will slow things down a great deal.  Single misses can cause impacts.  This is one of the best things you can do to improve your network performance, but most of the work isn’t in the network area.  If you can only do one thing to your network, do this first, then the rest of the paper.
4.    Time.  Full line rate with 64 byte packets is a packet every 69 nanoseconds.  A delay of a microsecond will slow 15 packets for a loss of 10K of performance.  Breakdown the CPU time, analyze the bus time, and watch the inter-packet rates coming out of the port.  If there is a gap of more than the IPG time, then that is slowing performance.  If you’re getting 9.2Gbps, that works out to roughly 3747 of “missing” usecs.  By reviewing the bus data you should be able to work up the stack to find the trouble.


Performance isn’t just a case of making a call to the vendor to make the magic happen.  Every part of the chain needs to be analyzed to improve the overall performance of the system.


Thanks for using Intel® Ethernet.