Home > Intel Communities > Open Port IT Community > The Server Room > Blog > 2009 > September > 26
Currently Being Moderated
14

The One Million IOPS game

Posted by bdgowda on Sep 26, 2009 10:47:12 PM

Few months back I saw an press release on Reuters from Fusion IO and HP claiming to hit 1 Million IOPS with a combination of Five 320GB ioDrives Duos and Six 160GB IO drives in an HP Proliant DL785 G5 which is a 4 Socket server with each socket having 4 cores, that makes a total of 16 cores in the server. I went saying wow that is amazing, a million IOPS is something any DBA running a high performance Database would like to get hands on. But when I did a quick search on the Internet for on how affordable the solution would be, I was horrified to see the cost which was clsoe enough to buy me couple of Mercedes E class sedan, all though the performance was stellar the cost and 2KB chunk size made me say which application does a 2KB read/write anyways, the default windows allocation is 4KB.

As time went by I got busy with other work till our Nand Storage Group  told us that they are coming up with a product concept based on PCIe to show a real 1 Million IOPS with 4KB block sizes which application in real world uses. This triggered the thought on what takes to achieve a 1 Million IOPS using generically available off-the shelf components.  I hit my lab desk to figure out what it takes.


Basically getting a Million IOPS depends on Three things:

1. Blazing fast Storage drives.
2.
Server hardware with enough PCIe slots and good  processors.
3. Host Bus Adapters capable of handling the significant number of IOPS


Setup:

  Intel Solid State Drives was my choice, there has been a lot discussed and written about the performance of Intel SSD's and that was easy choice make. I selected Intel X25-M 160GB MLC drives made using 34nm process. These drives are rated for 35K Random 4KB read IOPS and seemed like a perfect fit for my testing.

Then I started searching for the right Dual Socket server, this
Intel® Server Systems SR2625URLX with 5 PCIe 2.0 x8 provided enough slots to connect HBA's. The server was configured with Two Intel Xeon W5580 running at 3.2Ghz and 12GB of memory.

Search for the HBA was ended when LSI showed their 9210-8i series (Code named as Falcon) which has  been rated to perform 300K IOPS. These are entry level HBA's which can be configured to hook up up to Eight drives to eight Internal ports.

Finally I had to house the SSD's some where in a nice looking container, and a container was necessary to provide power connectivity to the drives. I zeored in on Super Micro 2U SuperChassis 216 SAS/SATA HD BAY, this came with Dual power supply and without any board inside it, but it provided me an option to simply plug in the drives to the panel and not worry about getting them powered. The other interesting thing about this Chassis is that, it comes with Six individual   connectors on the back plane so all each connector handles only Four drives, this is very different from active back planes which routes the signal across all the drives connected to them, this allowed me to just connect 4 drives per port on the HBA.  I also had to get a 4 slot disk enclosure ( Just some unnamed brand from local shop) in total I had capability to connect 28 drives.

With all the hardware in place, I went ahead and installed Windows 2008 enterprise server edition and Iometer (Open source tool to test IO performance). 2 HBA's were populated fully utilizing all 8 ports on them while other 3 HBA's were just populated with 4 ports only.  The drives were left without a partition on them. Iometer was configured with two manager processes with 19 worker threads 11 on one Manager and 8 on the other. The 4KB Random reads were selected with Sector alignment set to 4KB. The IOmeter was set to fetch last update on the result screen.

 

 

 

chart.gif

 

clip_image002.gif

 


Result:

Once the test started with 24 drives, and felt I was short of few thousands to reach 1M IOPS so I had to find the 4 bay enclosure to connect another 4 more SSD's taking the total number of SSD's to 28. There was a Million sustained IOPS from the server with an average of 0.88 ms latency and 80-85% of CPU utilization.  Please see below pics for more pictorial representation of the setup.

Conclusion:

Recently we demonstrated this setup at Intel Developer Forum 2009 at San Francisco, this grabbed attention of many visitors due to the fact that this is something an IT  organization can achieve realistically without spending a lot of initial investment, the good thing about this setup is that the availability of parts and equipments in open market. As Intel we wanted to get this thought started that High Performance storage without robbing a ton of money from your IT department's budget. Once a storage admin gets the idea on what is possible the industry will take more innovative approach to expand and tryout new setups using of the shelf components.

Next Steps:

I would be spending sometime to get this setup running with a RAID config and possibly use a real world application to drive the storage. This needs a lot of CPU resources and I have in mind one upcoming Platfrom from Intel which will let me do this. . I come up with followup experiments.

 

-Bhaskar Gowda.



Add a comment Leave a comment on this blog post.
Sep 28, 2009 5:13 AM Guest Minime  says:

Nice, but this needs heavy improvements like right yesterday. FusionIO gets already 180'000 IO/s with a much more elegant setting (28 SSD's, seriously?) and the CPU utilization is also not very arousing...

 

What takes Intel so long?

Sep 28, 2009 3:52 PM bdgowda bdgowda    says in response to Minime:

Hi Minime,

 

Thanks for your comments, the question here the cost as I mentioned the FusionIO solution as per google search shows me the 3TB soultion from FusionIO costs ~ $100k , My setup totally costed me ~$18K. The CPU utilization for any such setup is expected to be high due to the high volume of interrupts the CPU's should handle, infact FusionIO being a Host based management solution, the CPU utlization will be way higher than my setup. When you can get high performance IO subsystem for less than 20K why spend additonal 80K just for the sake of elegance, in a Datacenter packing a Million IOPS in a 2U solution which costs less and delivers more is what exactly a Storage admin and application architect would love to have.

 

-Bhaskar

Sep 30, 2009 4:21 PM Guest John Cagle  says:

Hi Bhaskar - nice job with the benchmark and IDF demo. It takes a lot of hard work to set those things up.

 

A few observations of your testing:

 

1) It appears that you used 100% 4K random reads. Is that correct? The Fusion-io/HP test used a mixture (70%/30%) of reads and writes which is more representative of real-world usage.

 

2) How many IOPS did you get for either 100% writes or 30% writes? In my experience, the SATA SSD's have much less write performance than read performance. In contrast, Fusion-io's storage has nearly symmetric read / write performance.

 

3) Intel sells (http://bit.ly/SZ8E6) the 160GB X25-M SSD for $899.99, so 28 of them would cost $25,199.72+tax.  The 5 HBAs are about $600/each, adding another $3,000.  Finally, add about $10,000 for the server/procs/ram/storage box/cables/etc.  That adds up to $38,199, so how exactly did your setup only cost you $18K?

 

Here's a link to the official Fusion-io/HP press release you mention:

http://www.fusionio.com/PressDetails.php?id=81

 

I look forward to your reply.

 

Thanks,

John

Oct 1, 2009 9:13 AM Guest David Flynn  says in response to bdgowda:

Bhaskar,

 

It's very rewarding to see the enterprise PCIe attached SSD space validated by Intel.  I know you guys don't generally pay attention unless it is a $B market.  Things have come a long way in the past year...

 

To clarify a misconception, if I may...

 

The reason Fusion-io's products appear to use more CPU is not due to host-side flash management.

 

What we do host side is cache the mapping of logical storage block to the physical location of the data on the Flash array.

 

This allows the application/core requesting an I/O to do it's own translation, and ask for the data directly from the flash array without having to wait to be serviced by a subordinate offload processor.

 

The net result of this approach is that I/O's are lower latency (less than 1 microsecond more than the ~25 microsecond flash page read time of SLC and ~40 microsecond on MLC)

 

This is why our MLC products get better latency than what others achieve even with SLC. (BTW: our datasheets are de-rated by ~2x on latency from what one really gets.  We also derate our bandwidth by almost 100MB/s - quite the opposite of most marketing departments)

 

While this ultra-low latency leads to much better application performance, from a CPU overhead per I/O perspective it can look heavier - for the simple reason that each I/O is completed individually which requires more interrupts.

 

What matters at the end of the day is application throughput.  Pure read I/O's and at higher latencies isn't really representative of how real applications will ultimately perform.

 

Take, for example, the TPC results both HP and Dell have published using Fusion-io devices.  They tell a more complete story - cutting the cost TPC-H query by more than half, and showing that they consider our devices to have the necessary enterprise reliability and data integrity.

 

Of course, need I say that you are comparing a product from Fusion-io that has been in full production for over a year now to a prototype.

 

-David

Oct 2, 2009 4:55 PM Guest Ron  says in response to John Cagle:

John,

 

FWIW, you can get the Kingston branded Intel X25-M drives (160GB) from Provantage for about $470/ea.  Check this out  http://www.provantage.com/kingston-technology-snm125-s2-160gb~7KIN90XW.htm.  They are the exact same drives just re-branded by Kingston.

 

The server chassis Bhaskar appears to be using is the SuperMicro 24x2.5" HDD unit (pn CSE-216E1-R900LPB) which can also be had at Provantage for about $1050.  So, from my calculations, you can get the drives (28 * 460 = $13,720), chassis ($1050), and motherboard/LSI controllers (maybe $4K for all) for around $18K - give or take a few $$$.

Oct 3, 2009 4:31 PM Guest Bhaskar  says in response to John Cagle:

Hi John,

 

Thanks for your comment,

 

1. I am still working on combination of read/write, however one of our partener team who did a sample concept with card sitting on PCIe slot showed over a Million IOPS with I think 66/34 read to write.

 

2. It is not difficult to find the price way less than you have posted, if you just look at right places. My comment just included storage cost.

 

-Bhaskar

Oct 3, 2009 4:36 PM Guest Bhaskar  says in response to David Flynn:

Hi David,

 

Thanks for your response and answers. This was just a concept demo I did which helps me in understanding I/O performance in virtualization area (which is my focus area) I am planning to run some actual application on this setup to put it in real world scenario, I will update the blog once I get some results out of this. BTW can you point me to TPC paper you are talking about, I didn't see anythin on Fusion IO site.

 

-Bhaskar

Oct 3, 2009 4:39 PM Guest Bhaskar  says in response to Ron:

I am using Supermicro 216 series chassis  (something I found inexpensive) as I said before it is not hard to find SSD"s for much better price on the internet.

Oct 8, 2009 1:35 PM Guest Dave Truslow  says:

Thank you for the article. Am I correct in believing that the configuration has a max theoretical I/O bandwidth of 8GB/sec? PCIe 2.0 1x16 at 500 MB/sec per lane = 8GB/sec. Looks like your test (1 million IOPS at 4K each) =4GB/sec.

 

Any insight into how HP/Fusion-io claims 8GB/sec (per press releases)? 2K * 1 million IOPS gives 2GB/sec. Seems like I'm missing something.

Thanks,

Dave

Oct 9, 2009 7:49 PM Guest Sumeet Bansal  says in response to Bhaskar:

Hi Bhaskar,

 

You can find the TPC-H result at http://www.tpc.org/tpch/results/tpch_result_detail.asp?id=109090801

 

Please note that the test got 1.09 USD per QphH@100GB with 2 spare Fusion-io drives. In some of the competetive testing, spares were not used.  Here is one such competetive test:   http://www.tpc.org/tpch/results/tpch_result_detail.asp?id=109082801

 

Without the 2 spare drives. the result would have been $0.96 (less than a Dollar). Fusion-io really is making a mark in the price/performance space.

 

If I may make an observation.  At some point having a large number of components in an architecture may provide the performance needed but it also increases the points of failure.  So compare 11 io-drives with 28 intel drives.  Going by mathematical probability, I would confidently say that there would be a greater number of service calls (Extra cost to a business) required to support a 28 disk architecture vs an 11 disk architecture.  Just to be clear, I am not addressing High availability here because any reasonable enterprise would address that by having multiple servers and ensuring that data exists on multiple servers at any given time.

 

Thanks

 

Sumeet Bansal

Oct 13, 2009 4:53 PM Guest Jiahua  says:

Hi Bhaskar,

 

In your post, it says the 9210-8i HBA can perform up to 300K IOPS. Where do you get the number? In fact, we are building a large machine with Intel SSDs and RAID controllers. We have a similar RAID controller, Intel RS2BL080, but we can only achieve about 50K IOPS even with 8 faster X25-E 64GB SSDs. Can you contact with me at <my name> at gmail dot com? I would like to know more details about your experiment settings.

 

Thanks,

Jiahua

Oct 17, 2009 12:50 PM bdgowda bdgowda    says in response to Jiahua:

Hi Jiahua,

 

9210-8i is rated for 300K IOPS, keep in mind when you put RAID stack on top the controller it will considerably slows down the controller performance, in my setup the drives were directly connected, there are lot of application who do RAW disk access on individual drives and user other techniques to manage RAID capabilites. I have noticed with 4 SSD's you could easily hit 15)K random read IOPS with this controller.

 

-Bhaskar

Oct 17, 2009 1:17 PM bdgowda bdgowda    says in response to Sumeet Bansal:

Sumeet,

 

thanks for the link to TPC paper.

 

-Bhaskar

Oct 17, 2009 1:22 PM bdgowda bdgowda    says in response to Dave Truslow:

Hi Dave,

 

Yes even I noticed it, since the article was just press release not any whitepaper or a blog post, it lacks details. I agree that with 2KB block size a Million IOPS should yield ~2.5GB/s not sure why FIO said 8GB/s. Even I am confused.

 

-Bhaskar