Skip navigation

The Data Stack

5 Posts authored by: bdgowda

In today’s highly mobile world, users want to access corporate data from smartphones, tablets, notebook computers, and other devices. They want to get there via various access points, including those in unsecured public networks. And they want to create and access mountains of digital media content that needs to be stored securely.


How do you respond to these new storage needs? Look to the cloud. That’s a suggestion embodied in a new Intel video now available on YouTube. This animated Secure Cloud Storage video demonstrates how you can create secure, reliable private cloud storage with solutions from Intel, Oxygen Cloud, and EMC.  This forward-looking solution allows authorized users to access their data from a secure private cloud, regardless of their locations or the devices they are using.


To create the video animation, our production team reproduced a reference architecture environment in a lab setting. Our setup consisted of a single Intel® Xeon® processor-equipped server running VMware vSphere® 4.0, EMC Atmos™ Evaluation Edition, and Oxygen Cloud® Storage Connector.


The animation highlights the capabilities of the solution components and walks through a step-by-step configuration process that culminates in the deployment of a robust private cloud storage environment. In the final step in the demo, the Oxygen client is installed on a Windows 7 Professional virtual machine and configured for a user’s credentials. When the user logs in, the Oxygen Cloud space appears as another drive in Windows Explorer. The user can then perform file operations—such as creating and copying files—in the newly created cloud storage space.


Like that, you’ve got a highly scalable solution for dealing with ever-growing amounts of digital data generated by tech-savvy users. For a closer look at all of this, grab a seat and watch the demo on YouTube. In just 18 minutes, you will get a firsthand look at how easy it can be to weave cloud storage into your IT mix.


And for an even deeper dive, you can review the details of the cloud storage reference architecture on the Intel Cloud Builders site. Check out the the Intel® Cloud Builders Guide: EMC, Intel, and Oxygen Cloud reference architecture.




The One Million IOPS game

Posted by bdgowda Sep 26, 2009

Few months back I saw an press release on Reuters from Fusion IO and HP claiming to hit 1 Million IOPS with a combination of Five 320GB ioDrives Duos and Six 160GB IO drives in an HP Proliant DL785 G5 which is a 4 Socket server with each socket having 4 cores, that makes a total of 16 cores in the server. I went saying wow that is amazing, a million IOPS is something any DBA running a high performance Database would like to get hands on. But when I did a quick search on the Internet for on how affordable the solution would be, I was horrified to see the cost which was clsoe enough to buy me couple of Mercedes E class sedan, all though the performance was stellar the cost and 2KB chunk size made me say which application does a 2KB read/write anyways, the default windows allocation is 4KB.

As time went by I got busy with other work till our Nand Storage Group  told us that they are coming up with a product concept based on PCIe to show a real 1 Million IOPS with 4KB block sizes which application in real world uses. This triggered the thought on what takes to achieve a 1 Million IOPS using generically available off-the shelf components.  I hit my lab desk to figure out what it takes.

Basically getting a Million IOPS depends on Three things:

1. Blazing fast Storage drives.
Server hardware with enough PCIe slots and good  processors.
3. Host Bus Adapters capable of handling the significant number of IOPS


  Intel Solid State Drives was my choice, there has been a lot discussed and written about the performance of Intel SSD's and that was easy choice make. I selected Intel X25-M 160GB MLC drives made using 34nm process. These drives are rated for 35K Random 4KB read IOPS and seemed like a perfect fit for my testing.

Then I started searching for the right Dual Socket server, this
Intel® Server Systems SR2625URLX with 5 PCIe 2.0 x8 provided enough slots to connect HBA's. The server was configured with Two Intel Xeon W5580 running at 3.2Ghz and 12GB of memory.

Search for the HBA was ended when LSI showed their 9210-8i series (Code named as Falcon) which has  been rated to perform 300K IOPS. These are entry level HBA's which can be configured to hook up up to Eight drives to eight Internal ports.

Finally I had to house the SSD's some where in a nice looking container, and a container was necessary to provide power connectivity to the drives. I zeored in on Super Micro 2U SuperChassis 216 SAS/SATA HD BAY, this came with Dual power supply and without any board inside it, but it provided me an option to simply plug in the drives to the panel and not worry about getting them powered. The other interesting thing about this Chassis is that, it comes with Six individual   connectors on the back plane so all each connector handles only Four drives, this is very different from active back planes which routes the signal across all the drives connected to them, this allowed me to just connect 4 drives per port on the HBA.  I also had to get a 4 slot disk enclosure ( Just some unnamed brand from local shop) in total I had capability to connect 28 drives.

With all the hardware in place, I went ahead and installed Windows 2008 enterprise server edition and Iometer (Open source tool to test IO performance). 2 HBA's were populated fully utilizing all 8 ports on them while other 3 HBA's were just populated with 4 ports only.  The drives were left without a partition on them. Iometer was configured with two manager processes with 19 worker threads 11 on one Manager and 8 on the other. The 4KB Random reads were selected with Sector alignment set to 4KB. The IOmeter was set to fetch last update on the result screen.









Once the test started with 24 drives, and felt I was short of few thousands to reach 1M IOPS so I had to find the 4 bay enclosure to connect another 4 more SSD's taking the total number of SSD's to 28. There was a Million sustained IOPS from the server with an average of 0.88 ms latency and 80-85% of CPU utilization.  Please see below pics for more pictorial representation of the setup.


Recently we demonstrated this setup at Intel Developer Forum 2009 at San Francisco, this grabbed attention of many visitors due to the fact that this is something an IT  organization can achieve realistically without spending a lot of initial investment, the good thing about this setup is that the availability of parts and equipments in open market. As Intel we wanted to get this thought started that High Performance storage without robbing a ton of money from your IT department's budget. Once a storage admin gets the idea on what is possible the industry will take more innovative approach to expand and tryout new setups using of the shelf components.

Next Steps:

I would be spending sometime to get this setup running with a RAID config and possibly use a real world application to drive the storage. This needs a lot of CPU resources and I have in mind one upcoming Platfrom from Intel which will let me do this. . I come up with followup experiments.


-Bhaskar Gowda.

  It has been nearing a month since I posted my blog on Extended Page Tables and it's niceties, I had promised to come up with a follow-up blog with some hands on test runs I had planned to run in my lab. With burgeoning  to do list from work and endless meetings per day, finally I made sometime and setup a testbed in the lab.

My goal was to run some workload on the hardware setup with Extended Page Table enabled to help the virtual machine to  translate memory address, then I also planned to rerun the workload on the same hardware setup without EPT and perform a comparison of both result sets. I wanted to keep the test simple enough to achieve my goal while making sure results are repeatable with multiple runs.


I decided to use open source workload called DVD Store, this workload was developed by Dell and passed over to open source community, it comes in varients of Microsoft SQL server, Oracle Database server and MySql Database server. The Database schema is made up of eight tables and few store procedures and transactions. The workload comes in Three different DB sizes of 10MB, 1GB and 100GB. However being an open source workload, it allows us to tweak the size of the database and customize is to suit specific size requirement. I went ahead and tweaked the database to be of 2GB in size, this allowed me to fit the Database and log files on the storage devices I had in the lab without going for an expensive SAN based storage. As the name of the workload says, this is a order processing OLTP database workload simulating customers browsing through the store and adding selected DVDs and completing the order. Primary metric coming out of the workload is the number of orders processed during the workload execution period, secondary metric is average milliseconds taken to process each order.





Intel S5520UR Dual socket server.

CPU: Intel Xeon X5550 2.67 GHz 8 cores


Hard drive: 500GB SATA II 7.2K RPM holding OS partition, Intel® X25-E Extreme SATA Solid-State Drive 3 Nos.

NIC: Embedded 1Ge full duplex.

Keyboard, mouse and Monitor



Gateway E-4610S SB

CPU: Intel Core2 Duo 4300 1.80GHz


Harddrive: 80GB SATA II

NIC: Embedded 1Ge full duplex.

OS: Windows XP professional with SP3.






Microsoft Windows 2008 enterprise server 64bit edition

Microsoft SQL 2005 64bit


I wanted to go with Solid state drives to ensure I am not disk bound anytime while running the workload, the alternative to run the workload without SSD would be to use a boatload of conventional hardrives increasing the setup complexity and foot print of my test hardware. Just using 3nos Intel SSDs makes life easier and provides terrific I/O performance.  ESX was naturaly the choice of hypervisor with 3.5 update 3 used in test run without EPT and ESX 4.0 to execute workload with EPT.


Test Methodology


I not going to delve deeper on how to setup the environment, OS instalaltion, application setup, and customizing the workload these topics are out of scope for this blog. But since it is required to know on how I ran my tests, I will talk about the methodology just enough for readers to understand the workload execution method and test duration, which helps in understanding the result chart below. Test was run from the client machine usinf workload driver and was ran for 10 minutes at a stretch and for Three times just to ensure the results were repeatable. The number of orders executed were pretty much close to with +- 100-200 OPM.







Above chart shows number of Orders the server was able to execute per minute. The X-Axis represents the number of vCPUs allocated to the virtual machine and Y-Axis shows the orders per minute. With each additional vCPU added to the virtual machine the number of orders executed by the server increases, as you can notice in the chart there is a 15%, 18% and 31% increase in number of OPM clearly scaling up with additional vCPUs allocated to the virtualmachine.


Response time.



Above chart shows the average response time to complete one order. There is a 15% to 40% decrease in response time between server running without EPT and server enabled with EPT.  In addition to improvement in response times, the server does 15%-30% more number of transactions.




When I completed my workload execution and came started seeing the data, it was apparent to me that EPT plays the major factor in improving performance of any virtualized workload. With virtualization technology achieving wide spread adoptability, IT orgs are exploring on how virtualize applications which were left untouched till now due to fear of peformance degradation and blowing up the SLA promised to the business. But Technologies like EPT provides enough reasons for the IT managers to start thinking about virtualizing critical workloads like SQL, Exchange etc. This is the last part of the Two part series blog in EPT. Feel free to comment if you have any questions.


Bhaskar D Gowda.

Back in 2001-02 when virtualization started to garner interest in the IT world, I wondered about running different Operating systems simultaneously on a server, I remember setting up a small environment in the lab using VMware GSX server and trying to run multiple operating systems side by side to each other. I had to take my focus out of virtualization after that due to change in my job role. I seldom spent much time looking into virtualization technology till I accepted another new role after Six years later.


A lot of development had happened over these years, with virtualization widely accepted among IT techs and management as a instrument to save money on IT expenditure, decrease the TCO, increase ROI and spend money wisely n ever shrinking IT budget. Processor technology moved mutlicore with more than Two cores available on a server and very few software which could take advantage of all those increased number of threads the multicore provided by processors, consolidating physical servers in form of virtual machines on a multicore processor based server helped IT to leverage additional threads and run their datacenters much cooler reducing the number of physical servers. Virtualization also brings in other goodies in terms of redundancy and disaster recovery. Since there are tons of material available on virtualization technology, I will stop here and won’t dwell deeper on virtualization.


As we all know the hypervisor also referred as Virtual Machine Monitor handles all the hardware resource slicing for the virtual machines running on top of it, providing identical execution environment. While VMM takes of time sharing hardware resources and allocating processor, memory and I/O slices to the virtual machines, it introduces significant latency and overhead since it has to translate every request concerning wit CPU/Memory and I/O and pass it on the actual physical device to  complete the request. This has been a Achilles heel for virtualization technology, IT organizations try to keep critical applications running in physical server since they fear the latency and overhead of VMM would bring down the performance of the application if virtualized. Intel introduced hardware assisted support for Virtualization within the processor in year 2005 called as Intel-VT technology. While Intel VT technology alleviated many performance issues associated with Processor virtualization solving part of the problem, memory overhead still remains.


Intel released new Intel Xeon 55XX series processor in March 2009; the new Xeon’s brings a many new technologies. Among the list of new things Xeon 55XX series brought in. One feature called Extended Page Tables also called as EPT. As I discussed earlier Hardware assisted virtualization support with Intel VT alleviated processor overheads, EPT takes care of memory overheads and provides virtual machines to perform much faster than software or VMM translated memory access. I would spend some time to discuss the three modes of memory addressing. A) How a normal process access memory on a physical machine. B) How software assisted memory management in Virtual machines with EPT support. C) How memory management is done when EPT is enabled.


Memory Management in Native Machines


In a native system the task of managing logical memory page numbers to Physical memory page numbers is handled by Operating System. The Operating system accomplishes this by storing the entries in something called as page table structures. When a process of any application access the logical address of the memory where it thinks the data is stored the hardware goes through the table structure to find out the physical address location of where the data is stored. Frequently accessed Logical page number to Physical page numbers of memory address locations are cached by the hardware system in Translation Look aside Buffer also called as TLB.  TLB is a small cache on the processor which accelerates the memory Management process by providing faster LPN to PPN mappings of frequently accessed memory locations.  Picture A shows Memory Management on a native machine.




Memory Management using VMM


When Virtual Machines are run on a hypervisor, the guest operating systems won’t have access to the hardware page tables like the natively run operating systems. The Virtual Machine Monitor emulates the page tables for the Guest operating systems and gives the guest Operating systems an illusion that they are accessing actual physical page numbers when mapping from Logical Page Numbers from the processes running.  The VMM actually runs a page table of it’s own called Shadow Page tables which is visible to the system hardware. So whenever the guest OS makes a request for virtual address translation to physical memory address the request is trapped by the VMM, which In turn run through its shadow page tables and provides the address of physical memory location. Picture B shows Memory Management using VMM.


While the VMM handles the LPN to PPN mapping quite efficiently, there are times when the page fault occurs considerably slowing down the application and the operating systems. The major penalty comes when the guest OS adjusts its logical mapping, this will trigger the VMM to adjust it’s shadow pages to keep in sync with the logical mappings of the Guest OS. For any memory intensive application running inside the guest OS , this process of syncing pages causes a hefty drop in performance due to the overhead of virtualization.


Memory Management using EPT


Hardware assisted memory management using EPT makes life easier for VMM. With EPT the TLB cache assumes an additional role to keep track of virtual memory and physical memory as seen to the guest OS. The individual virtual machines are tracked by the TLB by assigning them with an address space identifier.  Using the address space identifier the TLB can track the virtual machine address space and need not have to flush the TLB cache if one VM switches it space.


The advantage of having EPT manage memory for Virtual machine reduces the need for VMM to keep syncing the shadow pages eliminating the overhead, since the number of times the Shadow pages needs to be synced depends on the number of virtual machines running the server, elimination of sync produces tremendous increase in performance for server with larger number of virtual machines. In addition to this, the benefits of EPT scales with the number of virtual processors assigned to a particular VM, since the rise in processor count also increases the shadow page syncs. Using EPT to eliminate shadow page syncs enables the CPU’s to just sync TLB as the changes occur in the virtual pages, this process is close to achieving management of memory on a natively run operating systems. The only possible downside of managing memory using EPT is that the additional overhead it when there is a TLB miss, typically by many number of TLB stressing applications running on the same physical server. However the Hypervisors take an approach to reduce TLB misses by using large page tables.


Picture C shows Memory Management using EPT.


As a follow-up to my blog, I am setting up a quick lab environment to verify the EPT advantages. I think I will be able to post the results from my quick hands-on experiment in couple of week’s time.


-Bhaskar Gowda.





The Internet is abuzz on newly launched Intel Xeon processors, there are reviews showing manifold increase in server performance, which is for some type of applications the number is 150%. We have seen multiple records being shattered. Xeon 55XX series is doing the exact thing in the server world, what Core2duo did to the desktop space back in 2006. The beauty of new Xeon is that, its brings in something for everybody, Database applications, web servers, business logic servers, IT infrastructure applications, virtualization, HPC etc etc. While the IT administrators are busy reading reviews and calculating how much money they can save replacing thier aging infrastructure, I did like to give a small information about a less talked feature in the new Xeon called PCU.


While the new Xeon got a brand new architecture, much discussed features are Integrated Memory Controller, Quick Path Interconnect, Turbo Mode (Any body remember the Turbo Switch on your computer cases back in old days, Turbo Mode gets you the Turbo speed without the need of the switch). But there is onething our architects added to Xeon architecture which is quite interesting but not talked much about is the Power Control Unit or PCU, I am going to provide a simple understanding of this feature without delving into complicated terminology of gates, Phase Locked Loops etc.

While desktop users wont tend to bother much about power usage, things work differently in the server world. Data center architects and managers spend hundreds of hours crunching numbers on how to make their Data centers run cool without paying heft electricity bills. So having a power efficient processor under the hood of the server which can efficiently manage its power consumption means, saving money on power bills not only with actual power saving on the server but also the related cooling cost of the data center. Now that you know why it is a big deal to have a intelligent Microprocessor, lets see what is this thing PCU.


PCU is an on-die micro controller introduced and dedicated to manage power consumption of the processor, this unit comes with it's own firmware and

gathers data from temperature sensors, monitors current, voltage and takes inputs from operating systems and not to forget that it takes almost a Million transistors to put this this micro controller on-die, while a million sound like a drop in an ocean in a billion transistor processor, considering the older Intel 486 processor had the similar transistor count and ran windows 3.x quite well.


In simple words the PCU controls voltage applied to the individual cores by using sophisticated algorithms, and hence sending the idle core to almost shut off level and reducing the power consumption. But let me explain this in more elaborated manner. In an older generation CPUs it wasn't possible to run each core on different voltages since they shared the same source and the idle cores still leaked power. But with the new generation Xeon, even though the four cores gets voltage  from a common core voltage source, but thanks to a manufacturing material Intel uses we can run each core at different voltage level and have the ability independently clock them at different speeds. PCU can make this decision and nearly shuting off the idle core by cutting voltage to it and can intelligently increase the voltage to the active one of more cores bumping up the clock speed of one or more cores making them run faster, this is  what we call as Turbo. To make this more simpleter to understand, I can provide a simple water tap example on how this works, supposedly think we have a long water pipe with four taps connected to it, when only one tap is busy filling up a bucket with water, we can turn off other three taps and divert the the water pressure to the running tap and let that fill the bucket faster.


We can always say why there is a need for on-die power management when the same can be achieved by any operating system using ACPI power states, PCU accepts power state requests from operating systems but uses its own built in logic to doubly ensure that the OS request holds merit. There are instances where the operating system instructs the CPU to go to lower power state only to wake it up next moment, adding PCU get this process a fine grained efficiency and helps our customer data center run much cooler.

Filter Blog

By date: By tag: