The Data Stack

10 Posts authored by: JimBlakley

Cloud computing models are based on gaining maximum yields for all resources that go into the data center. This is one of the keys to delivering services at a lower cost. And power is one of the biggest bills in a cloud environment. Cloud data centers now consume an estimated 1–2 percent of the world’s energy.[1] Numbers like that tell you the cloud’s success hinges on aggressive power management.

 

So let’s talk about some of the steps you can take to operate a more efficient cloud:

 

  • Better instrumentation. The basis for intelligent power management in your data center is better instrumentation at the server level. This includes instrumentation for things like CPU temperature, idle and average power, and power and memory states. Your management capabilities begin with access to this sort of data.

 

  • Better power management at the server and rack level. Technologies like dynamic power capping and dynamic workload power distribution can help you reduce power consumption and place more servers into your racks. One Intel customer, Baidu.com, increased rack-level capacity by up to 20 percent within the same power envelope when it applied aggregated power management policies. For details, see this white paper.

 

  • Better power policies across your data center. Put in place server- and rack-level power policies that work the rest of the policies in your data center. For example, you might allocate more power capacity to a certain set of servers that runs mission-critical workloads, and cap the power allocated to less important workloads. This can help you reduce power consumption while still meeting your service-level agreements.

 

  • Better power management at the facilities level. There are lots of things you can do to drive better efficiency across your data center. One of those is better thermal management through the use of hot and cold server aisles. Another is thermal mapping, so you can identify hot and cold spots in your data center and make changes to increase cooling efficiency.

 

Ultimately, the key is to look at power the way you look at all other resources that go into your data center: seek maximum output for all input.

 


 

[1] Source: Jonathan Koomey, Lawrence Berkeley National Laboratory scientist, quoted in the New York Times Magazine. “Data Center Overload,” June 8, 2009.

Historically, IT data centers operated like warehouses that focused on housing the equipment brought by application developers. Today, these passive warehouses are being converted into dynamic factories that focus on achieving the maximum “application yield” from all of the resources that go into the factory. This yield is much like that of an auto factory that must produce a variety of models with the same shared set of resources.

 

There are two fundamental ways to increase the yield from your data center: efficiency by design and efficiency by operations.

 

Efficiency by design is all about designing for optimal output. One example: As you update your infrastructure over time, your new servers, storage systems, and networking equipment should deliver measurable increases in throughput and power efficiency. With each generation of technology, you should get more yield out of your equipment investments.

 

Efficiency by operations is all about managing the “resource inventory” of the data center through automation. The key here is to use automated solutions to carry out time-consuming tasks that were previously handled manually. Automation not only helps your administrators increase their productivity, it helps your data center managers ensure that the inventory of compute, storage, and network resources is used to its maximum capacity.

 

For example, you can use automated tools to:

  • Move demanding workloads to systems with excess capacity
  • Allocate additional storage to applications that are running out of disk space and reduce storage allocated to those applications that are not using it
  • Cap the power that flows to certain workloads without impacting performance
  • Update security tools and firewall settings on user systems

 

This is just a small sample or the actions you can take to increase the yield against your data center assets—including people, equipment, software, and power. There are many other things you can do. Just keep your eyes on the factory manager’s prize: maximum output for all resources that go into your facility.

How do you rate the maturity level of your power infrastructure?

 

As data centers grow in size and density, they take an ever-larger bite out of the energy pie. Today, data centers eat up 1.2 percent of the electricity produced in the United States. This suggests that IT organizations need to take a hard look at the things they are doing to operate more efficiently.

 

How do you get started down this path? Consider the following four steps toward a more energy-efficient data center. The degree to which you are doing these things is an indication of your power management maturity.

 

1. Power usage effectiveness (PUE) measurements: Are you using PUE measurements to determine the energy efficiency of your data center? PUE is a measure of how much power is coming into your data center versus the power that is used by your IT equipment. You can watch your PUE ratio to evaluate your progress toward a more energy-efficient data center. To learn more about PUE, see The Green Grid.

 

2. Equipment efficiency: Are you buying the most efficient equipment? Deploying more efficient equipment is one of the most direct paths to power savings. One example: You can realize significant power savings by using solid-state drives instead of power-hungry, spinning hard-disk drives. For general guidance in the U.S., look for Energy Star ratings for servers.

 

3. Instrumentation: Are your systems instrumented to give you the information you need? The foundation of more intelligent power management is advanced instrumentation. This is a pretty simple concept. To understand your power issues and opportunities, you have to have the right information at your fingertips. For a good example, see Intel Data Center Manager.

 

4. Policy-based power management: Have you implemented policy-based power management? This approach uses automated tools and policies that you set to drive power efficiencies across your data center. A few examples: You can shift loads to underutilized servers, throttle servers and racks that are idle, and cap the power that is allocated to certain workloads.

 

If you can answer yes to all of these questions, you’re ahead of the power-management maturity curve. But even then, don’t rest on your laurels. Ask yourself this one additional question: Could we save more by doing all of these things to a greater degree?

 

For a closer look at your power management maturity level, check out our Data Center Power Management Maturity Model. You can find it on the Innovation Value Institute site at http://ivi.nuim.ie/.

Years ago, data center managers didn’t think a whole lot about power expenditures. They were just a cost of doing business. But today, power expenditures have grown to the point that they are overwhelming IT budgets. Just how bad has it gotten? An IDC study conducted in Europe found that the cost of powering data centers now exceeds the costs of acquiring new networking hardware or new external disk storage.[1]

 

So let’s talk about five steps you can take to corral runaway power costs.

 

1. Dynamic power capping. With some workloads you can cap power without sacrificing performance. This might save you up to 20 watts per server. Power capping tends to work best with I/O intensive workloads, where CPUs spend a lot of time waiting for data. We’ve seen outstanding results with IT organizations that take a workload-centric approach to power capping.

 

2. Dynamic workload power distribution. When you have servers that not fully loaded you have the opportunity to shift virtualized workloads off of some servers, which can be put in a low-power state until they are called back into service. VMware’s Dynamic Power Management tool is the tip of the iceberg on this model.


3. Power capping to increase data center density. When server racks are under-populated, you’re probably paying for power capacity that you aren’t using. Intelligent power node management allows you to throttle system and rack power based on expected workloads and put more servers per rack.

 

4. Optimized server platforms. Optimized server platforms can give you more bang for your energy buck. Here’s one example: When cores within a CPU are idling, they are still drinking up power. Integrated power gates on processors allow idling cores to drop to near-zero power consumption.


5. Solid state drives. Today, lots of people are talking about performance gains with solid state drives. But that’s only part of the story. In addition to performance benefits, solid state drives can save you a bundle on power when compared to standard hard-disk drives.

 

And those runaway power costs we were talking about? Let’s go rope them in.

 

The first output out of the Intel Cloud Builder Program:

 

For Cloud Service Providers, Hosters and Enterprise IT who are looking to build their own
cloud infrastructure, the decision to use a cloud for the delivery of IT services is best done
by starting with the knowledge and experience gained from previous work. This white
paper gathers into one place a complete example of running a Canonical Ubuntu Enterprise
Cloud on Intel®-based servers and is complete with detailed scripts and screen shots. Using
the contents in this paper should significantly reduce the learning curve for building and
operating your first cloud computing instance.


Since the creation and operation of a cloud requires integration and customization to
existing IT infrastructure and business requirements, it is not expected that this paper 
can be used as-is. For example, adapting to existing network and identify management
requirements are out of scope for this paper. Therefore, it is expected that the user of
this paper will make significant adjustments to the design to meet specific customer
requirements. This paper is assumed to be a starting point for that journey.

 

http://software.intel.com/en-us/articles/intel-cloud-builder/

 

http://blog.canonical.com/?p=348

Learn about Intel IT’s proof-of-concept testing and total cost of ownership (TCO) analysis to assess the virtualization capabilities of Intel® Xeon® processor 5500 series. Our results show that, compared with the previous server generation, two-socket servers based on Intel Xeon processor 5500 series can support approximately 2x as many VMs for the same TCO.

 

http://communities.intel.com/servlet/JiveServlet/downloadBody/3425-102-1-5699/VirtualizationXeon5500.pdf

So, its not clear from this posting whether VMware's "Code Central" was announced or escaped but this looks to be a very valuable repository for sharing vSphere scripts.

 

I'm a recent convert to the wonders of creating new capabilities through the vSphere SDK. Our team has been using it to prototype some interesting new usages for power aware virtualization that we hope eventually will find their way into the VMWare Distributed Power Management (DPM) tool.

 

The most interesting usage is what we call "platooning" where different server resource pools are kept in different power states from fully powered on through power capped to standby and full off. Servers are moved from one platoon to the next (and workloads are migrated onto them) based on a set of policies for required application capacity headroom and power on latency as load increases. Our belief is that, by carefully designing these policies, we'll be able to save significant power across the data center without impacting peak throughput or response time of any of the applications.

 

Unfortunately, we don't have the data to demonstrate this savings yet. That's where the SDK comes in. We're able to prototype the usage, collect the data, validate the feasibilty and, if it never shows up in DPM, still be able to implement it in production.

 

We're just coming up to speed on the SDK, having completed our first "Hello World" integration with it but we think its going to be a very valuable tool for experimenting and going to production with many new usages. I'm hoping Code Central provides a good source of examples to help bootstrap our development.

What if every server in your virtualized data center was driving 10Gbps of traffic?

My team just completed a test with an end user where we drove nearly 10Gbps of traffic over Ethernet through a single Xeon 5500 based server running ESX4.0. The workload was secure FTP. Our results will be published in the next 30 days. We’ve seen 10Gbps through a server in several other cases (notably, video streaming and network security workloads) but this is first time we’ve really tried to do a 10GB “enterprise” workload in a virtualized environment. It took a fair amount of work to get the network and the solutions stack to work (we had to get a highly threaded open source SSH driver from the Pittsburgh Supercomputer Center, for example, to make it scale). We also found some good value for some of our specialized network virtualization technologies (i.e., the VT-c feature known as VMDQ). But, regardless, by working at it moderately diligently, we got it to work at 10Gbps and don’t see any real barriers to doing that in real production environments.

We also found that the solution throughput is not particularly CPU-bound, it’s “solution stack bound”. That means that workloads that are more “interesting” than virtualized secure FTP and video streaming are likely to be able to source and sync more than 10Gbps/server, too. And, when we get to converged fabrics like iSCSI and FCOE that put the storage traffic on the same network path (or at least the same medium) as the LAN traffic, we’d expect that the application needs for higher Ethernet throughput will increase.

So what? Well, if you buy the fact that virtualized servers can do interesting things and still drive 10GB/s of Ethernet traffic, you have to wonder what’s going to happen to the data center backbone network. If you have racks with 20 servers each, putting out a nominal 6Gbps of Ethernet traffic, each rack will have a flow of 120Gbps and a row of 10 racks will need to handle 1.2 Tbps. I’m not sure what backbone data center network architecture will be able to handle that kind of throughput. Fat tree architectures help especially if there are lots of flows between servers in close proximity to each other in the same data center. But, fat tree networks are very new and not widely deployed. Thoughts?

So, building off Bob's post from September (http://communities.intel.com/openport/thread/1905), I contend that, at least from a performance perspective, with the new capabilities in the next generation of virtualized infrastructure coming this year, the answer is yes!

As we look at the availability of ESX 4.0 from VMWare and servers based on the Intel Nehalem-based Xeon servers with new VT features for CPU, chipset and IO later this year, we're not seeing any of the mainstream applications that can't be virtualized. In the past, some of the mainstream apps that (allegedly) couldn't be virtualized that we've consistently heard are SAP and other complex business processing apps, middle sized databases and large enterprise email systems like Microsoft Exchange. While it's a little early to declare victory, we're thinking the next generation of technology will be more than good enough to run these workloads in most environments. We're currently running testing on the lastest generation infrastructure software and not seeing any reason why most of these apps won't be capable of being virtualized over the next couple of years.

Anyone think differently? Why?

Note, other issues remain:

  • Even if I don't run the applications on the same physical server as other applications, is the virtual infrastructure secure and reliable enough to support these important applications?
  • And, if I try to consolidate the app with other apps, can I be guaranteed that the app won't interfere or be interfered with by other apps. Interference could be either unintentional resource contention or intential security attacks.
  • Do I have the tools and support infrastructure to do such a critical application in a virtual infrastructure.


I'm making no claims on whether these particular challenges have been solved but I would be interested in whether they are real issues for you.

What do you think?

 

Jim Blakley

 

<Note: This is a duplicate to the blog I posted at VMWorld Europe last week. I'll pull over the responses as replies to this>

So, after four days of VMWorld, there were two announcements that really resonated with me as an end user proxy within Intel. For those who don't know me, my team's role is to look at the new technologies that are coming (or might come) from Intel from the eyes of the end user. We try to understand and quantify whether end users really find any value in these technology innovations and, through hands on work in our own labs and directly in end user IT environments, identify any technical and ecosystem barriers to adoption. When we find barriers, we work across the industry to address them. My team is specifically focused on the data center and we have a big focus on data center virtualization. So, yes, the vision that Paul Maritz outlined in his keynote makes absolute sense to me. Plenty has been written about the keynotes (and maybe I'll add my own thoughts in a bit). I wanted to talk about a couple of specific things that Paul mentioned and that, to me, were very encouraging and significant.

 

Technology innovations that directly and specifically address an expressed customer need don't always come to market quickly, especially if they require coordinated effort across different companies. I also don't believe the new conventional wisdom that, with virtualization, "the hardware doesn't matter". Two announcements at VMWorld demonstrate great examples of the former and give lie to the latter.

 

 

The first announcement was Cisco's unveiling of the Nexus 1000v virtual switch. One of the big issues for IT shops deploying virtualization has been that it's next to impossible to easily integrate virtual networking into the existing network management processes and roles and responsibilities. It's been the CCNE's that have enabled physical networks to be managed for reliability, security and compliance and, until now, virtual switches have not allowed that separation of duties and transfer of skills that are embodied in the CCNE's. The Nexus 1000V, a virtual softswitch that will launch next year (according to the demonstrator in their booth), will run side-by-side with the VMWare vSwitch inside ESX server and give CCNEs full Nexus OS access to configuring and monitoring the vSwitch using the same interfaces they're used to on the "hard switches". It also can enforce a separation of duties between the network administrator and the server administrator. This issue has been something that we've heard repeatedly from end users as a barrier to adoption for virtualization 2.0 in the enterprise and Cisco and VMWare have deserve a lot of credit for collaborating closely to make this a reality. (BTW, it also looks to me like the first tangible evidence that higher level networking functionality is beginning to migrate back to where it started: to software on general purpose computers. Perhaps more on that later).

 

 

The second was the announcement by VMWare of Enhanced VMotion and by Intel of VT FlexMigration. (Sorry if this part seems a little self serving from an Intel guy). These two capabilities, working together address another key need of end users. Until now, each new generation of CPU needed to maintained in a separate resource pool in the data center. If you didn't and you VMotioned backward from a new generation to an old one, it was possible that the guest application would make use of an instruction that didn't exist in the older generation. So, that kind of migration was not permitted. This restriction means that end users had to either grow resource pools by purchasing older generation hardware (and foregoing the energy efficiency and performance gains of the new hardware) or live with increasing fragmentation into resource "puddles". With EVmotion and FlexMigration, the hypervisor can now assure that the backward migrated VM doesn't use any of those new instructions. Voila, the backward migration can be allowed! Pools can be grown by adding new generation servers to a pool of older servers, a much smoother and more efficient approach to evolution in the data center.

 

 

Now, in retrospect, both of these innovations seem "obvious" but actually getting them to market is challenging and significant challenges still remain to implement them in real world environments. Perhaps more significant is that they both required the two companies to recognize the need, align their business interests to address, design a joint solution and coordinate the launch of their respective product offerings. Hard enough to do this across teams in the same company, let alone across two companies.

 

 

So, do you see other technology challenges like this with your virtualization projects? Simple problems that seem obvious but no one seems to be addressing?

Filter Blog

By author:
By date:
By tag: