Skip navigation

Permanent Storage

Posted by dougb Nov 25, 2009

     Words mean things.  And sometimes multiple words mean the same thing.  In our land we use NVM, EEPROM and Flash interchangeably at times.  This can be confusing, so this posting is a primer on the what, the why and the how of the storage on the adapters.

     First let’s break up the acronym soup.  NVM is Non-Volatile Memory.  It is memory that can survive power being gone for a long time.  EEPROM is electrically erasable programmable Read Only Memory.  Flash is based off the Fowler–Nordheim tunneling effect.  In "ye olde" days, an EEPROM and a Flash were very different inside and out.  Now days only the size seems to make a difference.  Both use the same HW principle.  EEPROMs can be erased and written one word at a time.  Flash are erased in sector or whole chip erase, and can be written one word at a time.

     A quick aside on erasing:  Flash have an unusual feature.  You can change a 1 bit to a 0 bit with a write command, but you can only change a 0 to 1 via erasing the word/image.  This makes a big difference since having to erase then write can add to the programming time.  EEPROM parts can write 0 to 1 via a single write command.  The software does not need to execute an erase command before writing an EEPROM word.  One reason the largest Flash parts are bigger than the largest EEPROM parts is that there is an extra transistor on each bit cell, allowing each bit to be changed in either direction.  Flash parts share this transistor and therefore must be erased in sector blocks.  It’s a little backwards to most people that "blank" is all 1s, but that's the way the electricals works.  Back to our show!

     As you probably see, an EEPROM is a NVM as is a Flash.  So why use the less precision term NVM over EEPROM?  Again history points the way.  Back before the rise of Wired for Management spec and its inclusion into WHQL (Windows* Hardware Qualification Labs) certifications, Flash was rarely included on an implementation.  But the silicon would need a storage site for things like MAC address, Wake on LAN settings and other things. This required us to put storage on the card.  We elected to use EEPROMs.  Small, simple and cheap and able to hold enough data for our needs, it was a perfect match.  Then as the need for pre-boot technologies (like PXE, which is a whole 'nother post) started to rise, WFM put the requirement that every card brought its option ROM with it.  At the time the only a Flash was big enough so that led to the dual approach, requiring both Flash and EEPROM.  Flash forward (pardon the pun) a decade and EEPROMs are almost the same size as the Flash of the earlier period.  At this point our team elected to use just one part, but segment it virtually to have a part that functions in the role of the EEPROM and part functions as the Flash.  Now we had products that you could use either a Flash or an EEPROM in this role and this is when we started calling either NVM.  And since terms seem to leak backwards, some people use it apply to EEPROM and Flash as separate items.  We have products that use a Flash in the role of the EEPROM and they call it NVM as well.  When somebody says NVM, just think storage of configuration data and option ROMs.  Be sure to check the datasheet and other documentation to make sure which storage family, EEPROM or Flash, is appropriate for your design.  Then you can call it NVM.


That was a bunch of stuff at once, so let’s end it on a review:

1)  NVM is EEPROM and/or Flash

2)  NVM/EEPROM/Flash will have device configuration information in it and is required for normal operation

3)  Thanks for using Intel(R) Ethernet products.


Moore's Law Effect on Networking

Posted by dougb Nov 20, 2009

Moore's Law is a well known part of history of Intel.  Lesser known is how that impacts the networking world.  Many of the features of modern Ethernet are offload workloads from the processor(s).  The effect of Moore's law is that the CPU gets faster at doing those workloads at the same rate that the Ethernet controller does.  At Intel, we learned that lesson with our PRO/100 Smart adapter and it really came home with our PRO/100 Intelligent adapter.  These are older 10/100 megabit adapters and they were offloading IPX workloads.  The cards had processors on board to put together IPX data streams and move the data into the memory of the host system.  The PRO/100 Smart came out just as the Intel486™  to Pentium® processor transition was happening.  The 486 and this new thing called PCI bus, really couldn't handle the 100Mb workloads.  The Smart moved the datagram assembly into the coprocessor and things were good (on the early Pentiums at least).  However, as the same cards went into faster buses and faster systems, the net improvements went down.  We upgraded the PRO/100 Smart into the PRO/100 Intelligent with a faster coprocessor and Ethernet controllers (not to mention almost half the size) but things didn't reflect as well as we had hoped.  The Pentium II processor was out and the PCI chipsets had gained significant performance since their introduction.  Even with the coming Gigabit transition, we could see the writing on the wall.  The power of the coprocessor would never be more than the gains of power of the processor via Moore's law.  The data offload would work for one generation, then be made worthless by the next processor.  We looked into making a third generation offload card for the Gigabit generation, but we figured out that within six months of launch, another Intel CPU would come out to remove any performance advantage the offloads created.

Software was already a problem since the operating system likes to control things like the network interface.  When data movement is offloaded from the domain of the O/S, it can take a lot of work with the O/S vendor.  With our IPX offloads, that just meant working with Novell*.  They were very willing partners.  But not all vendors and O/S teams are like that.  Because the code that runs on the coprocessor is closed source, it can have very limited acceptance in open source engagements.  The code on the coprocessor would also be subject to defects that could be harder to fix in the field than a driver issue.  Nobody really likes having to get new firmware.  Not to mention what could happen if an exploit was in the offload code.  With the O/S in charge, the risk of exploits can be limited since the admin can close off ports, or hand patch the exploit.  In the coprocessor model, you are at the mercy of the adapter vendor.  And the only mitigation is to either turn off the coprocessor (which may not always be an option) or remove or shutdown the card.  Not something our customer support people were very happy about having to recommend.

           The return on investment calculation was easy when we looked at it.  Moore's Law would always put our coprocessor to reduced effectiveness almost by the time the card shipped.  The software model was easy on paper, but the realities of O/S vendors, exploit risks and field upgrades made it very complex.  No matter how many we shook the magic 8 ball, all signs pointed to No.

So our IPX offload engine card family ended before the third generation was launched.  We put our efforts into dataflow efficiency and stateless offloads that would provide value no matter the CPU abilities.  This has provided value to our customer no matter what processor goes into it. Fighting Moore’s Law is like fighting the tide. Rather, we took the strategy to ride the tide and use stateless offloads to reduce latency. Today, our strategy has paid off; we have systems that can saturate multiple bidirectional 10 Gigabit links.  


Time to wrap up our history lesson:

1)  Moore's Law means the processor you buy tomorrow will most likely out perform the offload you buy today without allowing enough time to return the extra cost of a coprocessor.

2)  What looks good on paper is often proved less than effective by the real world

3)  Thanks for using Intel Ethernet products.


     At Intel, being a network software engineer in the wired LAN Access Division does not mean that you just write device drivers for our networking silicon. Being a network software engineer at Intel means that you also become what I like to call a "part time electrical engineer".


     When a new silicon project starts up, a small team of network software engineers is assigned to the project.   Each team member gets their network software (device drivers, diagnostic software, etc) ready to be executed on the new silicon by updating their code to meet the new chip specification.   Soon afterwards, the team gets some FPGA (Field Programmable Gate Array) PCIe boards, programmed with RTL (Resistor–Transistor Logic) from our silicon design team.  The FPGA’s job is to execute the RTL code on the PCIe bus to simulate the how the MAC (Media-Access-Control) silicon should operate in the real chip (albeit much slower than the real silicon, since it's a simulation). 


     The software engineering team then executes their network software on the FPGA looking for bugs by running specific tests and performing silicon validation on the FPGA.  Once this is done, our design team fixes fixes and debugs the issues we found in the FPGA and then completes the design.  At this point, a few last minute RTL fixes could come our way that we need to help validate on the FPGA.  The silicon is now ready for the fabrication and is now "taped out" and heads to the Fab to become a "real" piece of silicon.


     Once the silicon is out of the Fab, the software engineering team starts up a new silicon validation effort on the A0 silicon.  This time we not only look for obvious issues in the silicon by monitoring what our software is doing to the actual MAC silicon itself (what I like to call "bit-level debugging"),  but we also execute all sorts crazy tests on the silicon, all in an attempt to make the MAC silicon fail.   We also look for regression bugs that might have crept in from the previous MAC chip silicon generation in to the current chip we are working on. 


     If there are new silicon features on the MAC, the network software engineering team writes special code and devotes extra tests/time in testing the new features to make sure they are working properly and do not actually make the silicon do something it should not.  If issues are found and it is deemed that the MAC silicon cannot ship, the silicon issues are addressed by our designers and a new stepping of the chip is created and the whole SV process is started again. 


     In short, being a network software engineer at Intel means that we not only write the software that makes our silicon connect you to the world, but we also validate the silicon and features that help make that network connection faster and more efficient.  Thus your neighborhood-friendly network software engineer does not just write software, but also validates the silicon is working as it supposed to.

     It has been said there are lies, darn lies and statistics.  Well here in the Wired Ethernet world, we tend to frown on that saying.  The statistics can be down right useful in figuring out problems, in either your software or your network.  Today I'll look at using the stats for maximizing the performance of your implementation when it comes to dropped packets.  And a kitchen sink will show us the way.  Hope you brought your towel.


     The Intel 1 Gigabit productshave two sets of stats that are useful in this regards.  First is the Receive No Buffer Count or RNBC.  It will increment when a frame has been successfully loaded into the FIFO, but can't get out to host memory where the buffers are because there are no free buffers to put it into.

Second is the Missed Packets Count, or MPC. This is the count of frames that were discarded because there was no room in the MAC FIFO to store the frames before they were DMA'ed out to host memory.  You will typically see RNBC growing before you will see MPC grow.  But, and this is a key point, you don't need to have a event increment RNBC before you can MPC increment.


     First a primer on the MAC architecture that Intel Wired Networking uses.  Coming from the physical layer, either Copper or SerDes (or other), the packet will be stored in the MAC FIFO.  It gets processed to get there, and it gets processed some more before going to the DMA block for the actual trip out to host memory.  If there is a buffer available, the DMA block sends it on its way.  The descriptor associated with the buffer is updated and the world is good.  Well good enough.  In the case RNBC, the frame is happily in the FIFO, but without a buffer to head home to, it has no where to go.  In the MPC case, the poor frame can't even get into the FIFO and is dropped off the wire.  MPC is also sometimes called an overrun, because the FIFO is overrun with data.   An underrun is a TX error, so that's out of bounds for this talk.  Plus they are pretty rare these days.


     As you can tell, RNBC is not too bad, but points to bigger problems.  MPC is pretty bad, because you are dropping frames.  So how can you have MPC without RNBC?  Imagine if you will, an interconnect bus that is slow.  Very slow.  Like a PCI 33hz bus.  Now attach that to a full line rate 1 Gigabit 64 byte packet data stream.  At one descriptor per packet, that's about 1.4 million descriptors per second.  In this case the software is very fast, faster than the bus.  So the number of available descriptors is always kept a level that keeps the buffers available to the hardware to conduct a DMA.  But because the bus is so slow, data backs up into the FIFO.  Now that is what the FIFO is for.  By buffering the packet, it tries to give the packet the best chance at making into host memory alive.  In our slow case, the buffering isn't enough and the FIFO fills up.  It is draining slower than its filling, its just a like a slow draining kitchen sink.  Eventually it overflows and makes a big mess.  Thank goodness things like TCP/IP will tell the applications data has been dropped, but if using a lossy frame type like UDP its just too bad, your frame is lost to the ether.  If you need to keep track, but need to use UDP, you'll need to monitor the MPC count and decide what you want to do when it goes up.


     As already noted, RNBC doesn't always lead to MPC, but it is a warning flag that it will happen.  Here is how the RNBC can climb while MPC stays low.  Imagine we have a slow CPU, but a wicked fastbus.  The software is very slow to process the descriptors and return them, but once the descriptors are given to the hardware, it empties the backlog (read the FIFO) faster than the incoming frames are filling the FIFO.  Returning to our kitchen sink analogy, the water is coming in at a fairly constant rate.  But imagine the stopper is down, making the sink fill up.  Just before it over flows, the drain is opened and down it goes.  Once the water doesn't go down the drain would be the same moment our RNBC would be incremented.  The kitchen sink itself becomes our FIFO and if the FIFO is big enough, it can save frame for quiet some time.  This is 1 Gigabit (or faster) that we're talking about, so with a good sized FIFO (24K RX for example) that's only 375 frames at 64 bytes, or 267microseconds of data.  That's not very much time.  But in a world full of 2 and 3 Gigahertz CPUs that's long enough.  If you have 2048 descriptors for it to dump into, that is almost 8 times the amount of packet time before the FIFO starts filling up.


     And you're probably sitting there saying "I was told there wasn't going to be math on this blog!"  Moving on(and that's enough about the bad news), lets talk about the good news.  Both RNBC and MPC are either treatable in software, or can be minimized by careful design.  RNBC is really a software problem at its core, but a fast bus never hurt.  If you‘re getting the RNBC moving up, add more descriptors and buffers.  Make sure your ISR or polling loop is running often enough to get back to the business of adding more resources to the card before the descriptors stash runs out.  Using our example from above, if your expecting 64 byte frames, you'll need to poll every millisecond or so if you have 2048 descriptors.  Looking at it from the other direction, if your trying to do 1 Gigabit of traffic with only 8 descriptors, RNBC is going to jump around like a cat in a rocking chair store.  Consult your documentation (link goes to the 8257x Open Source SDM) how to add more descriptors, all our major O/S products support it.  There may be times when you've added all the buffers the driver will let you and your still seeing RNBC errors.  When this happens, it's a sign that the stack might be the limit.  In modern operating systems, the buffers are O/S buffers and while we might have 2048 of them, if the O/S has ownership of 2047 of them, RNBC will be just part of life.  Most stacks have their own buffers that you can tinker with their count, so that can help.  Check the stats of your stack to see if it is having troubles keeping up.  There will be times when the RNBC will go up, but it will look like the stack and driver have a ton of buffers but work is not being done.  If you have a task that is eating up the CPU, the ISR or polling routines won't refill the buffers fast enough and RNBC will happen.


     MPC is treatable depending on what RNBC is doing.  If RNBC isn't moving around much, there is room for the data, its just not getting out of the FIFO fast enough.  Much like the movie where if the bus goes below a certain speed bad things will happen, the same applies to MPC.  Maybe without all the flash of a Hollywood movie, but the principle is the same.  The bus is the limit.  Move to a more traffic friendly slot. While slot topology and its impact on performance is a whole 'nother post; it's only common sense that a x4 card may drop frames if put into a x1 connected slot.  Give that card room to DMA and most MPC errors will go away when the slot speed matches the maximum speed the card supports.  If you have MPC and RNBC climbing at the same rate, most likely the bus isn't the limit, the buffer reload speed is.  Treat the RNBC issue first and then see if MPC is still going out of control.


     Even with a super fast bus and a ton of buffers, there will be times when RNBC will happen and there will be time when MPC happen.  Sometimes a big burst of traffic comes just as the descriptor count gets low, sometimes the ISR doesn't run exactly when you hoped it would.  The trick is not letting either one be a big percentage of the total number of packets.  When it does get out of control, follow this post and you'll see improvement in the percentage.


Time for the big finish.

1.  RNBC is a warning sign of a slow drain from the MAC and can be treated by adding more buffers.

2.  MPC is a failure condition leading to dropped packets and can be treated with more buffers and faster interconnect buses.

3.  Thanks for using Intel networking products.

Filter Blog