1 2 Previous Next

The Data Stack

16 Posts authored by: egcastroleon

Luiz Barroso in his classic 2007 paper posited that given that servers in data centers were loaded between 10 and 50 percent of peak, it would be beneficial from an energy perspective to have servers with a large power dynamic ratio, the ratio of power consumed at full workload to power at idle.  The figure below actually represents the state of the art today with a dynamic ratio of about 2:1 and efficiency that can drop below 20 percent.  The operating band depicted is more conservative than what Barroso indicated, with a CPU utilization that rarely surpasses 40 percent.

 

 

 

TraditionalDataCenter.png

The next figure illustrates what happens if we improve the dynamic ratio to 5:1.  This is not possible today for single servers, but it is attainable for cloud data centers and as a matter of fact, for any environment where servers can be managed as pools of fungible resources and where server parking is in effect.

 

CloudDataCenter.png

 

The improved dynamic ratio also dramatically improves the operating efficiency in the operating band of the data centers, but it gets even better:  the servers in the active pool are kept in the sweet spot of utilization in the range of 60 to 80 percent.  If the CPU utilization in the active pool gets below 60 percent, the management application starts removing servers from the active pool to the parked pool until the utilization starts inching up.  If the CPU utilization gets close to the upper range, the management applications starts bringing back servers from the parked pool into the active pool to provide relief and bring the utilization numbers down.

 

In our prior blog entry we saw that a common technique to reduce lighting power costs in residential and commercial buildings is to turn lights off in unused rooms.  This concept is so widely accepted that we rarely give it a second thought, let alone challenge it.   If that’s the case, why has this concept not been applied to servers in a data center, blazing away, drawing electricity 24 by 7, 365 days a year even at times when there is no work to be done?  There are more extreme cases of “dead servers walking”, servers that are no longer associated with useful applications, but have not been unplugged.

 

Two approaches are commonly applied to reduce lighting power consumption in residential or commercial buildings: turning lights off and using dimming mechanisms.

 

Turning lights off yields the greatest power savings, assuming the room is not to be used.  There is still a small amount of residual power being drawn to power pilot lights or motion sensors to turn on the illumination if someone enters the room.

 

Dimming the lights reduces power consumption when a room is in use it is possible to reduce the illumination level while allowing people to occupy the room for the intended purpose.  For instance, illumination in certain areas may not be needed because mixed daylight is in use, zonal lighting on work areas is sufficient, or because the application calls for reduced lighting, such as in a restaurant or dining room.  Power saved through dimming will be less than turning lights off.

 

Similar mechanisms are available in servers deployed in data centers. Servers can be shut down and restarted under application control when not needed.  We call the action of shutting down servers for power management purposes server parking.  This is the equivalent of turning lights off in a room.  The capability for “dimming lights” in a server is embodied by the Intel® Enhanced SpeedStep® technology or EIST and Intel® Intelligent Power Node Manager technology or Node Manager.  EIST reduces power consumption during periods of low workload and Node Manager can cap power, that is, reduce power consumption at high workload levels under application control.

 

In tests performed at our lab, the 2-socket white box Urbanna server provisioned with Intel® Xeon® 5500 Series processors, 6 DIMMs and one hard drive have a power consumption of about 50 percent of the power consumption at full load, about 150 watts out of 300, this is when the effect of EIST.  If the server is working under full load, the 300 watts consumed at full power can be reduced by about 30 percent down to 210 watts or so. 

 

There is a “dimming” effect from power capping due to the voltage and frequency scaling mechanism used to implement power capping.  However, the tradeoff between performance and power consumption is more complex than the relationships in the lighting example. If the server is not working at full load, there may be enough cycles left in the server to continue running the workload without an apparent impact on performance.   In this case, the penalty is in the amount of performance headroom available should the workload pick up.  The solution to this problem is simple.  If the extra headroom is called for, the management application setting the cap can remove it and the full performance of the server becomes available in a fraction of a second.

 

There is also a richer set of options for turning off servers than there are for turning lights off.  The ACPI standard defines at least three states suitable for server parking: S3 (sleep to memory), S4 (hibernation where the server state is saved in a file) and S5 (soft off, where the server is powered down except for the circuitry to turn it on remotely under application control.)  The specific choice depends on hardware support; not all states are supported by a specific implementation.  It also depends on application requirements.  A restart from S3, if supported by the hardware, can take place much faster than a restart from S5.  The tradeoff is that S3 is somewhat more energetic than S5 because of the need to keep the DIMMs charged.

 

A widespread use of server parking is not feasible with traditional where a hard binding exists between the application components and the hardware host because bringing any of the hosts offline could cripple the application.  This binding gets relaxed for virtualized cloud environments that support dynamic consolidation of virtual machines into a subset of active hosts.  The sub-pool of active hosts is grown or shrunk for optimal utilization levels.  Vacated hosts are parked, the equivalent of turning lights off in a room, and as in the lighting example, once a server is in parked state the server can’t run applications.

 

Unlike branch circuits used for lighting where the workload is sized to never exceed the circuit’s capacity, branch circuits feeding servers may be provisioned close to capacity.  One possible application for Node Manager is to establish a guard rail for power capping to kick in if the power consumption gets close to the limit.

 

Virtualization and cloud computing bring costs down by enabling the reuse and sharing of physical and application resources  leads to a more efficient and higher degree of utilization for that particular resource.

 

Most IT organizations today are under enormous pressure to keep their budgets in check. Their costs are going up, but their budgets are flat to decreasing as illustrated in the figure below.  This is more true than ever in this period of economic and financial crisis. The situation is not sustainable and eventually leads to unpleasant conditions such as slower technology refresh cycles, reduced expectations for IT value delivered and layoffs. The service re-use inherent in cloud computing promises long lasting relief from the cost treadmill.

KTBR-Legacy.png

Conceptually, a portion of IT budgets is used to maintain existing projects.  It's the portion dedicated to maintain office productivity applications help desk or the organization that provides telephone services. This portion is important because is the part that “keeps the business running” (KTBR). In most IT organizations, the KTBR portion takes the lion’s share of the budget. The downside is that the KTBR is backward looking, and it’s only the leftover portion that can be applied to grow the business. There is another problem: the KTBR portion left unchecked tends to grow faster than IT budgets overall, and the situation can't stay unchecked forever.

 

 

A number of strategies have been used in IT organizations to keep the KTBR growth in check. Perhaps the most oft used in the past few years is the outsourcing of certain applications such as payroll and HR applications such as expense reports and the posting of open positions in the corporation.  When outsourcing (and perhaps off-shoring) is brought in, costs actually go up a notch as reorganizations take place and contracts are negotiated. Once the outsourcing plans are implemented costs may go down, but still have the problem of sustainability. Part of the initial cost reductions comes from salary arbitrage, especially when the service providers in lower cost countries.  Unfortunately the cost benefit from salary arbitrage tends to diminish with time as these countries advance technically and economically.

 

KTBR-outsourcing.png

 

A third alternative comes from technology refreshes as shown below.

 

KTBR-techrefresh.png

 

The introduction of a new technology, lowers the cost of doing business, seen as a cost dip. Costs can be managed through aggressive “treadmill” of technology adoption, but this does not fix the general uptrend, and not many organizations are willing or even capable of sustaining this technology innovation schedule.

 

Finally, the adoption of cloud computing will likely lead to a structural and sustainable cost reduction for the foreseeable future due to the synergies of reuse. As in the outsourcing case, there is an initial bump in cost due to the upfront investment needed and while the organization readjusts and goes through the learning curve.

KTBR-costreduction.png

Cloud computing reduces both capital and operational expenses through multiple factors:

 

  • Economies of scale: The service provider becomes an expert in the field and can deliver the service more efficiently at lower administrative costs than any other provider, possibly at a lower price than the cost of implementing the same service in house.  (OpEx)
  • The infrastructure is shared across multiple tenants. (CapEx)
  • Application software licensing costs are shared across tenants.(OpEx)
  • The environment is virtualized allowing dynamic consolidation.  Servers are run at the most efficient utilization sweet spot, and hence fewer servers overall are required to deliver a given capability. (CapEx)
  • The traditional IT infrastructure is highly siloed.  Once these silos are broken, there is no need to overprovision to meet peak workloads. (CapEx)
  • Expensive and slow capital procurement processes are no longer necessary. (CapEx)
  • The IT organization can defer server purchases and decommision data centers as in house capabilities are phased out in favor of cloud services (CapEx)

It is undeniable that cloud computing activities have come to the forefront in the IT industry to the point that Gartner declares “The levels of hype around cloud computing in the IT industry are deafening, with every vendor expounding its cloud strategy and variations, such as private cloud computing and hybrid approaches, compounding the hype.”  As such, Gartner has added cloud computing to this year’s Hype Cycle report and placed the technology right at the Peak of Inflated Expectations

 

Michael Sheehan in his GoGrid blog analyzed search trends in Google* Trends as indicators technologies’ mindshare in the industry.  Interest in cloud computing seems to appear out of nowhere in 2007, and interest in the subject keeps increasing as of the end of 2009.

 

Also worth noting the trend of virtualization, one of the foundational technologies for cloud computing.  Interest in virtualization increased through 2007 and reached a plateau in 2008.  Likewise, the trend in terms of news reference volume has remained constant in the past two years.

 

 

Blue line: Cloud computing

Red line: Grid computing

Orange line: Virtualization

 

 

GoogleTrends.png

 

 

 

 

Figure 1. Google Trends graph of search volume index and news reference volume for cloud and grid computing and virtualization.

 

 

Given this information, is cloud computing at its peak of hype, about to fall short of expectations and bound to fall into the trough of disillusionment?  According to Gartner, the goal of this exercise is to separate hype from reality and enable CIOs, CEOs and technology strategists to make accurate business decisions regarding the adoption of a particular technology.


Cloud computing does not stand for a “single malt” technology in the sense that mesh networks, speech recognition or wikis are.  Rather, cloud computing represents the confluence of multiple technologies, not in the least including grid computing, virtualization and service orientation.  Hence the Gartner Hype Cycle may not be an accurate model to be useful in predicting how the technology will evolve and will be adopted in the industry.

 

If the Gartner hype cycle theory is to apply to cloud computing, it cannot be in isolation.  In addition to the three enabler technologies mentioned above, we need to add the Internet for making possible the notion of federated computing.  From this perspective, what we may be witnessing is actually the Hype Cycle’s Slope of Enlightenment.  The search volume index for the Internet is shown in Figure 2.

 

 

 

InternetTrend.png

 

 

 

 

Figure 2.  Google Trends Search Volume Index for the Internet.

 

The graph by itself does not look very interesting until we note that it is actually a picture of the Trough of Disillusionment: the time frame is actually too short to be meaningful.  We can claim that the Peak of Inflated Expectations actually occurred in the years 1994 through 2001, that is, the period of the infamous Internet boom.


Beginning at the end of 2007 we see the convergence of grid, virtualization and services into the cloud and the Internet infrastructure build-out beginning to pay off.  Grid computing moves from niche applications, starting with scientific computing, to technical and engineering computing, to computational finance into the mainstream enterprise computing.  Cloud computing would not be possible without the dark fiber laid out in the 90s.


The technology trigger period is actually much longer than what the Gartner graph suggests.  For a number of watershed technologies there is usually a two or three-decade incubation period before the technology explodes into the public consciousness.


This pattern took place with the radio industry from the 1901 Marconi experiments transmitting Morse code over radio to the first broadcasts in 1923.  With the automotive industry the incubation period spans from the invention of the first self-propelled vehicles in the late 19th century to 1914 with Henry Ford’s assembly line manufacturing and the formation of large scale supply chains.  For the Internet the incubation period took place during the government internet with the creation of ARPANET in 1969 and the trigger started with the commercialization of the internet marked with the official dissolution of ARPANET in 1990.


The trigger point for a technology is reached when a use case is discovered that makes the technology self-sustaining.  For the automobile it was the economies of scale that made the product affordable and Ford’s decision to reinvest profits to increase manufacturing efficiencies and lower prices to spur demand.  For the radio industry was the adoption of the broadcast model supported by commercial advertising.  Before that there was no industry to speak of.  Radio was used in small scale as an expensive medium for point-to-point communication.


Consistent with the breadth of the technologies involved, the commercial development of the internet developed along multiple directions during the speculative period in the 1990s.  The Peak of Inflated Expectations saw experimentation of business models, with the vast majority proving to be unsustainable. The speculative wave eventually went bust shortly after 2000.


Hence we’d like to claim that the recent interest in cloud computing, taken in the context of prior developments on grid computing, the service paradigm and virtualization and over the infrastructure provided by the Internet, is actually the slow climb into the Slope of Enlightenment.  Experimentation will continue, and some attempts will still fail.  However the general trend will be toward mainstreaming.  In fact, one of the success metrics predicted for the grid was that the technology becoming so common that no one would think about the grid anymore.  This pattern is already taking place with federated computing and federated storage.

One of the first questions in my mind when I was first exposed to Intel(r) Intelligent Power Manager (Node Manager) was "what is the performance impact of applying Node Manager technology?"  I will share some thoughts.  The underlying dynamics are complex and not always observable and hence it's difficult to provide a definitive answer.  Robert A. Heinlein popularized the term TANSTAAFL ("There ain't no such thing as a free lunch") in his 1966 novel “The Moon is a Hard Mistress”.  So, does TANSTAAFL apply here? Node Manager brings benefits with the ability for the application to designate a target power consumption, a capability otherwise known as power capping. On the cost side, Node Manager takes some work to deploy, and has performance impact that varies from very little to moderate.  On the other hand, Node Manager can be turned off, in which case there is no overhead.

 

Node Manager is useful even when it is not actively power capping but is used as a guardrail, ensuring that power consumption will not exceed a threshold.  The predictable power consumption has value because it provides data center operators a ceiling in power consumption.  Having this predictable ceiling helps optimize the data center infrastructure and reduce stranded power.  Stranded power refers to a power allocation that needs to be there even if it's only for occasional use.

 

The performance impact can vary from zero when Node Manager is used as a guardrail to a percentage equal to the number of CPU cycles lost due to power capping when Node Manager is applied at 100% utilization.  When applied during normal operating conditions, the loss of performance is smaller than the number of cycles lost to power capping implies because the OS usually compensates for the slowdown.  If the end user is willing to re-prioritize application processes, under some circumstances it is possible to bring performance back to the uncapped level or even beyond.

 

Power capping is attained through voltage and frequency scaling.  Power consumed by a CPU is proportional to frequency and to the square of the voltage applied to the CPU.  This is done in discrete steps (“P-states” as defined by the ACPI standard.

 

The highest performing P-states are also the most energetic.  Starting from a fully loaded CPU and the highest P state, the DBS assigns lower energy P-states as workload is reduced utilizing the Intel(r) SpeedStep technology.  An additional dip takes place as idle is reached as unused logical units in the CPU are switched off automatically.

 

Node Manager allows manipulating the P-states under program control instead of autonomously as under SpeedStep.  Since the CPU is running slower, this has the effect of potentially removing some of the cycles that otherwise could be used by applications, but reality is more nuanced.

 

At high workloads, most CPU cycles are dedicated to running the application.  Hence, if power capping is applied, a reduction in CPU speed will yield and almost one-to-one reduction in application performance.

 

At the other end of the curve, if the CPU is idling and power consumption is already at the floor level.  An application of Node Manager will not yield any additional power consumption reduction.

 

The more interesting cases take place in the mid-range band of utilization, when the utilization rate is between 10 and 60 percent, depending on the application (40 to 80 percent in the BMW case studybelow.)  Taking utilization beyond the upper limit is not desirable because the system would have difficulty in taking up load spikes and hence response times may deteriorate to unacceptable levels.

 

We have run a number of applications in the lab and observed their performance behavior under Node Manager.  Surprisingly, the performance loss is less than frequency scaling would indicate.  One possible explanation is that when utilization is in the mid-range, there are idle cycles available.  The OS compensates to some extent for the slower cycles by increasing the time slices to the applications, using up otherwise idle cycles, to the point that the apparent performance of the application is little changed.  The application may need to be throttled up to re-gain the pre-capping throughput.

 

One way to verify this behavior is to observe that CPU utilization has indeed gone up in a power capped regime.  BMW conducted a proof of concept with Intel precisely to explore the boundaries of the extent to which that application could be re-prioritized under power capping to restore the original, uncapped throughput.  TANSTAAFL still applies here.  The application is still yielding the same performance under power capping.  However, since there are fewer cycles available due to frequency scaling, there will be less headroom should the workload pick up suddenly.  In this case the remedy is simply to remove the cap.  The management software needs to be aware of these circumstances and initiate the appropriate action.

 

The experiments in this proof of concept involved an application mix used at a BMW site.  In the first series of experiments we plotted power consumption against CPU utilization by throttling the workload up and down, shown in red.

 

 

BMW-savings.png

 

In the second series, shown in green, for each dot in the original curve we apply an initial power cap.  This yields a performance reduction.  The workload is throttled up until the uncapped performance is restored.  This process is repeated with increasingly aggressive power policy caps until the original performance cannot be reached. The new system power consumption without impacting system performance is shown plotted in green.  The difference between the red and green curves represents the range of capping applicable while maintaining the original throughput level.  The execution and running at the green level yields the same uncapped system performance. However, since idle cycles have been removed, there is no margin left to pick up extra workload.  Should it happen, performance indicators will deteriorate very quickly.

 

Under the circumstances described above, the system was able to deliver the same throughput at a lower power level.  There was no compromise in performance.  The tradeoff is in the form of diminished headroom in case the workload picks up.  The system operator or management software have the option to remove this cap immediately should this headroom be needed.

In spite of significant gains in server energy efficiency, power consumption in data centers is still trending up.  At the very least, we can make sure that the energy expended yields maximum benefit to the business.  A first step in managing power in the servers in a data center is having a fairly accurate monitoring capability for power consumption.  The second step is to have a number of levers that allow using the monitoring data to carry out an effective power management policy.

 

While we may not be able to stem the overall growth of power consumption in the data center, there are a number of measures we can take immediately:

  • Implement a peak shaving capability.  The data center power infrastructure needs to be sized to meet the demands of peak power.  Reducing peaks effectively increase the utilization of the existing power infrastructure.

 

  • Be smart about shifting power consumption peaks. All the watts are not created equal.  The incremental cost of generating an extra watt of power during peak consumption hours is much higher than the same watt generated in the wee hours of the morning.  For most consumer and the smaller commercial accounts flat rate pricing still prevails.  Real time pricing (RTP) and negotiated SLAs will become more common to put the appropriate economic incentives in place.  The incentive of real time pricing is a lower energy bill overall, although the outcome is not guaranteed.  In pilot programs residential consumers have complained that RTP result in higher electricity costs.  With negotiated SLAs the customer can designate a workload to be subject to lower reliability; for instance, instead of 3 9’s, or outages amounting to about 10 hours per year, the low reliability workload can be designated as only 90 percent reliable, and can be out on the average of two hours per day.

 

  • Match the electric power infrastructure in the data center to server workloads to minimize over-provisioning.  This approach assumes the existence of an accurate power consumption monitoring capability.

 

  • Upgrading the electrical power infrastructure to accommodate additional servers is not an option in most data centers today.  Landing additional servers at a facility that's working at the limit of thermal capacity leads to the formation of hot spots, this assuming that electrical capacity limits are not reached first with no room left in certain branch circuits.  Hence measures that work under the existing power infrastructure are to be preferred over alternatives that require additional infrastructure.

 

 

For the purposes data center strategic planning it may make economic sense to grow large data centers in a modular fashion.  If the organization manages a number of data centers, consider making effective use of the existing data centers, and when new construction is justified, redistribute the workloads to the new data center to maximize the use of the new electrical supply infrastructure.

 

Intel has built into its server processor lineup a number of technology ingredients that allow data center operators optimize the utilization of the available power system infrastructure in the data center.

 

 

Newer servers of the Nehalem generation are much more energy efficient, if only because of the side effect of increased performance per watt.  These servers also have a more aggressive implementation of power proportional computing.  Typical idle consumption figures are in the order of 50 percent of peak power consumption.

 

 

Beyond passive mechanisms that do not require explicit operator intervention, the Intel® Intelligent Power Node Manager (Node Manager) technology allows adjusting the power draw of a server and trade off power consumption against performance.  This capability is also known as power capping.  The control range is a function of server loading.  For the Intel SR5520UR baseboard on the 2U chassis, the server will draw about 300 watts at full load and its power consumption can be rolled down to about 200 watts.  The control range tapers down gradually until it reaches zero at idle.

 

 

For power monitoring, selected models of the current Nehalem generation come with PMBus specification compliant power supplies allowing real-time power consumption readouts.

 

 

The Node Manager power monitoring and capping capability apply to a single server.  To make this capability really useful it is necessary to exercise these capabilities collectively to groups of servers, to add the notion of events and a capability to build a historical record of power consumption for the servers in a group.  The additional capabilities have been implemented in software through the Data Center Manager Software Development Kit developed by the Intel Solutions and Software Group.  An additional Software Development Kit, Cache River allows programming access to components in servers and server building blocks produced by the Intel Enterprise Products Server Division (EPSD), including the baseboard management controller (BMC) and the management engine (ME), the subsystems that host or interact with the Node Management firmware.  EPSD products are incorporated in many OEM and system integrator offerings.

 

Data Center Manager implements abstractions that apply to collections of servers:

  •   A hierarchical notion of logical server groups
  •   Power management policies bound to specific server groups
  •   Event management and a publish/subscribe facility for acting upon and managing power and thermal events.
  •   A database for logging a historical record for power consumption on the collection of managed nodes.

 

 

The abstractions implemented by DCM on top of Node Manager allow the implementation of power management use cases that involve up to thousands of servers.

 

If this topic is of interest to you, please join us at the Intel Development Forum in San Francisco at the Moscone Center on September 22-24.  I will be facilitating course PDCS003, "Cloud Power Management with the Intel(r) Xeon(r) 5500 Series Platform."  You will be the opportunity to talk with some of our fellow travelers in the process of developing power management solutions using Intel technology ingredients and get a feel of their early experience.  Also please make a note to visit booths #515, #710 and #712 to see demonstrations of early end-to-end solutions these folks have put together.

I would like to elaborate on the topic energy vs. power management in my previous entry.

 

   

 

Upgrading the electrical power infrastructure to accommodate additional servers is not an option in most data centers today.  Landing additional servers at a facility that's working at the limit of thermal capacity leads to the formation of hot spots, this assuming that electrical capacity limits are not reached first with no room left in certain branch circuits.

 

   

 

There are two types of potentially useful figures of merit, one for power management and one for energy management.  A metric for power management allows us to track operational "goodness", making sure that power draw never exceeds limits imposed by the infrastructure.  The second metric tracks power saved over time, which is energy saved.  Energy not consumed goes directly to the bottom line of the data center operator.

 

     

To understand the dynamic between power and energy management let's look at the graph below and imagine a server without any power management mechanisms whatsoever.  The power consumed by that server would be P(unmanaged) regardless of any operating condition.  Most servers today have a number of mechanisms operating concurrently, and hence the actual power consumed at any given time t is P(actual)(t).  The difference P(unmanaged) - P(actual) is the power saved.  The power saved carried over time t(1) through t(2) yields the energy saved.

 

 

 

EnergySavings.png

Please note that a mechanism that yields significant power savings may not necessarily yield high energy savings.  For instance, the application of Intel(r) Dynamic Power Node Manager (DPNM) can potentially bring power consumption by over 100 watts, from 300 watts at full load to 200 watts in a dual-socket 2U Nehalem server that we tested in our lab.  However, if DPNM is used as a guard rail mechanism, to limit power consumption if a certain threshold is violated, DPNM may never kick in, and hence energy savings will be zero for practical purposes.  The reason why we do this is because DPNM works best only under certain operating conditions, namely high loading factors, and because it works through frequency and voltage scaling, it brings a performance tradeoff.

 

   

 

Another useful figure of merit for power management is the dynamic range for power proportional computing.  Power consumption in servers today is a function of workload as depicted below:

 

PowerGraph.png

The relationship is not always linear, but the figure illustrates the concept.  On the x-axis  we have the workload that can range from 0 to 1, that is, 0 to 100 percent.  P(baseline) is the power consumption at idle, and P(spread) is the power proportional computing dynamic range between P(baseline) and power consumption at 100 percent workload.  A low P(baseline) is better because it means a low power consumption at idle.  For a Nehalem-based server, P(baseline) is roughly 50 percent of power consumption at full utilization, which is remarkable, considering that it represents a 20 percent over the number we observed for the prior generation, Bensley-based servers.  The 50 percent figure is a number we have observed in our lab for a whole server, not just the CPU alone.

 

   

 

If a 50 percent P(baseline) looks outstanding, we can do even better for certain application environments such as load-balanced front end Web server pools and the implementation of cloud services through clustered, virtualized servers.  We can achieve this effect through the application of platooning.  For instance, consider a pool of 16 servers.  If the pools is idle, all the servers except one can be put to sleep.  The single idle server is consuming only half the power of a fully loaded server, consuming one half of one sixteenth of the cluster power.  The dormant servers still draw about 2 percent of full power.  Hence, after doing the math, the total power consumption for the cluster at idle will be about 8 percent of the full cluster power consumption.  Hence for a clustered deployment, the power dynamic range has been increased from 2:1 for a single server to about 12:1 for the cluster as a whole.

 

   

 

In the figure below note that each platoon is defined by the application of a specific technology or state within each  technology.  This way it is possible to optimize the system behavior around the particular operational limitations of the technology.  The graph below is a generalization of the platooning graph in the prior article.  For instance, a power capped server will impose certain performance limitations to workloads, and hence we assign non time critical workloads to that platoon.  By definition, an idling server cannot have any workloads; the moment a workload lands on it it's no longer idle, and its power consumption will rise.

 

   

 

The CPU is not running in any of the S-states than S0.  The selection of a specific state depends on how fast that particular server is needed online.  It takes longer to bring up a server online in the lower energy states.  Servers in G3 may actually be unracked and put in storage for seasonal equipment allocation.

 

   

 

A virtualized environment makes it easier to rebalance workloads across active (unconstrained and power capped) servers.  If servers are being used as a CPU cycle engines, it may be sufficient to idle or put to sleep the subset of servers not needed.

 

PowerTransitions.png

 

The extra dynamic power range comes at the expense of instituting additional processes and operational complexity.  However, please note that there are immediate benefits in power and energy management accrued through a simple equipment refresh.  IBM reports an 11X performance gain for Nehalem-based HS22 blade servers versus the HS20 model only three years old.  Network World reports a similar figure, a ten-fold increase in performance, not just ten percent.

 

   

 

I will be elaborating on some of these ideas at the PDCS003 Cloud Power Management with the Intel(r) Nehalem Platform class at the upcoming Intel Developer Forum in San Francisco on the week of September 20th.  Please consider yourself invited to join me if you are planning to attend this conference.

There are two technologies available to regulate power consumption in the recently introduced Nehalem servers using the Intel® Xeon® processor 5500 series.  The first is power proportional computing where power consumption varies in proportion to the processor utilization.  The second is Intel® Dynamic Power Node Manager (DPNM) technology which allows the setting of a target power consumption when a CPU is under load.  The power capping range increases with processor workload.

 

An immediate benefit of the Intel® Dynamic Node Manager (DPNM) technology is the capability to balance and trade off power consumption against performance in deployed Intel Nehalem generation servers.  Nehalem servers have a more aggressive implementation of power proportional computing where idle power consumption can be as small as 50 percent of the power under full load, down from about 70 percent in the prior (Bensley) generation.  Furthermore, the observed power capping range under full load when DPNM is applied can be as large as 100 watts out for a two-socket Nehalem server with the Urbanna baseboard observed in the lab to draw about 300 watts under full load.  The actual numbers you will obtain depend on the server configuration: memory, number of installed hard drives and the number and type of processors.

  

Does this mean that it will be possible to cut the electricity bills by one third to one half using DPNM?  This is a bit optimistic.  A typical use case for DPNM is as a "guard rail".  It is possible to set a target not to exceed for the power consumption of a server as shown in the figure below.  The red line in the figure represents the guard rail.  The white line represents the actual power demand as function of time; the dotted line represents the power consumption that would have existed without power management.

 

PowerCap.png

 

Enforcing this power cap brings operational flexibility: it is possible to deploy more servers to fit a limited power budget to prevent breakers from tripping or to use less electricity during peak demand periods.

 

 

There is a semantic distinction between energy management and power management.  Power management in the context of servers deployed at a data center refers to a capability to regulate the power consumption at a given instant.  Energy management refers to the accumulated power saved over a period of time.

 

The energy saved through the application of DPNM is represented by the area between the dotted line and the white graph line below; the power consumed by the server is represent by the area under the solid white graph line.  Since power capping is in effect during relatively short periods, and when in effect the area between the dotted line and the guard rail is relatively small, it follows that the energy saved through the application of DPNM is small.

 

One mechanism for achieving significant energy savings calls for dividing a group of servers running an application into pools or "platoons".  If servers are placed in a sleeping state (ACPI S5 sleep) during periods of low utilization it is possible to bring their power consumption to less than 5 percent of their peak power consumption, basically just the power needed to keep the network interface controller (NIC) listening for a wakeup signal.

Platooning.png

As the workload diminishes, additional servers are moved into a sleeping state.  The process is reversible whereby servers are taken from the sleeping pool to an active state as workloads increase.  The number of pools can be adjusted depending on the application being run.  For instance, it is possible to define a third, intermediate pool of power capped servers to run lower priority workloads.  Capped servers will run slightly slower, depending on the type of workload.

 

Implementing this scheme can be logistically complex.  Running the application in a virtualized environment can make it considerably easier because workloads in low use machines can be migrated and consolidated in the remaining machines.

We are conducting experiments to ***** the potential for energy savings.  Initial results indicate that these savings can be significant.  If you, dear reader have been working in this space, I'd be more than interested in learning about your experience.

 

If this topic is of interest to you, please join us at the Intel Development Forum in San Francisco at the Moscone Center on September 22-24.  I will be facilitating course PDCS003, "Cloud Power Management with the Intel(r) Xeon(r) 5500 Series Platform."  You will be the opportunity to talk with some of our fellow travelers in the process of developing power management solutions using Intel technology ingredients and get a feel of their early experience.  Also please make a note to visit booths #515, #710 and #712 to see demonstrations of early end-to-end solutions these folks have put together.

The Intel(r) Dynamic Power Node Manager technology allows setting a power consumption target for a server under load as described in a previous article.  This is useful for optimizing the number of servers in a rack when the rack is subject to a power budget.

 

 

 

 

 

 

 

 

 

 

Higher level software can use this capability to implement sophisticated power management schemes, especially schemes that involve server groups.  The range of control authority for servers in the Nehalem generation is significant.  The power consumption of a fully loaded server consuming 300 watts can be rolled back by roughly 100 watts.  In virtualized utility computing environments additional control authority is possible by migrating the virtual machines out of a host and consolidating them into fewer host.  The power consumption of the power capped host now at 200 watts, can be brought down by another 50 watts, to 150 watts.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

The reader might ask about the possibility of constantly running servers in capped mode to save energy.  Unfortunately capping entails a performance tradeoff.  The dynamic is not unlike driving an automobile.  The best mileage is obtained by running the vehicle at a 35 MPH constant speed.  This is not practical in a freeway where the the prevailing speed is 60 MPH.  The vehicle could be rear ended, or perhaps a more mundane motivation, the vehicle driver drives the vehicle at 60 MPH because she wants to get there sooner.  Like a server, the lowest fuel consumption in a running vehicle, at least in gallons per hour, is attained when the vehicle is idling.  No real work is done with an idling engine, but at least the vehicle can start moving in no time.  Continuing with the analogy, turning a server off is equivalent to storing a car in the garage with the engine stopped.

     

 

 

 

 

 

 

 

 

This document provides an example of the performance tradeoff with power capping.  Please look in page 5, Figure 2.

 

 

 

 

 

 

The following example illustrates how group power capping works.  The plot is a screen capture of the Intel(r) Data Center Manager software managing the power consumption in a cluster of four servers.  The four servers are divided in a cluster of two server sub-groups of two servers each, labeled low-priority and high-priority

 

DCM-GUI.png

 

The light blue band represents the focus of the plot. The focus can be changed with a simple mouse click.  The current focus in the figure is the whole rack.  Hence the power plot is the aggregated power for all four servers in a rack.  If the high priority sub-group were selected, then the power shown would be the power consumed by the two servers in that sub-group.  Finally, if a single server is selected, then the power indicated would be the power for that server only.

     

 

 

 

 

 

 

 

 

 

There are four lines represented in the graph.  The top line is the plate power.  It represents an upper bound for the server’s power consumption.  For this particular group of servers the plate power is 2600 watts.  The servers are identical, and hence rated at 2600 / 4 = 650 watts. 

 

 

 

 

The next line down is the derated power.  Most servers will not have every memory slot or every hard drive tray populated. The derated power is the data center’s operator guess about the upper bound for power consumption based on the actual configuration the server.  The derated power is still a conservative guess, considerably higher than the actual power consumption of the server. As a rule of thumb, it is ~70% of the nameplate. The derated power has been set at 1820 watts for the rack or 455 watts per server.

     

 

 

 

 

 

 

 

Finally, the gold line represents the actual power consumed by the server.  The dots represent successive samples taken from readings from the instrumented power supplies. 

     

 

 

 

 

 

 

 

The servers are running at full power using the SPECpower benchmark.  The rack is collectively consuming a little less than 1300 watts.  At approximately 16:12 a policy is introduced to constrain power consumption to 1200 watts.  DCM instructs individual nodes to reduce power consumption by lowering the set points for Node Manager in each node until the collective power consumption reaches the desired target.

 

When we instructed Data Center Manager to hold a power cap for the group rack (2), it makes an effort to maintain power at that level, in spite of unavoidable disturbances in the system. 

 

The source of the disturbances can be internal or external.  An internal disturbance can be the server fans switching to a different speed causing a power spike or dip.  Workloads in servers go up and down, with a corresponding uptick or dip in the power consumption for that server.  An external disturbance could be a change in the feed voltage or an operator action.  In fact at T = 16:14 we introduced a severe disturbance: we brought the workload of the bottom server, epieg3urb07 down to idle. 

 

 

 

 

Note that it takes a few seconds for Data Center Manager to react and to reach the original power level.  Likewise, when the bottom server is brought to idle, it also pulled back the power consumption for the group.  However, the group power went back to the target power consumption after a couple of minutes.  If we look at the plot of the individual servers, we can see Data Center Manager at work maintaining the target power.

 

Combined Power.png

 

The figure above captures the behaviors of the individual servers.  Note how DCM allocates power to individual nodes yet it maintains a global power cap. When the server at the bottom is suddenly idled, there is a temporary dip in power server consumption for the group, but it soon recovers to the target capped level.  Also note that the power not used by the bottom server is reallocated to the remaining three nodes until they get close to the previously unconstrained level.

 

 

In this installment on uses of server power management we continue the discussion on using this capability for other uses beyond server rack density.

 

Intel(r) Data Center Manager (Intel DCM) is a software development kit that can provide real time information to optimize data center operations.  It provides a comprehensive list of publish/subscribe event mechanisms that can form the basis of a sophisticated data center management infrastructure integrating multiple applications where applications get notified of relevant thermal and power events and can apply appropriate policies.

 

These policies can span a wide range of potential actions:  dialing back power consumption to bring it down below a reference threshold or to reduce thermal stress on the cooling system.  Some actions can be complex, such as migrating workloads across hosts in a virtualized environment, powering down equipment or even performing coordinated actions with building management systems.

 

Intel DCM also provides inlet temperature or front panel thermals along with a historical record that can be used to identify trouble spots in the data center.  This information provides insights to optimize the thermal design of the data center.  The actions needed to fix trouble spots need not be expensive at all; they may involve no more than relocating a few perforated tiles or installing blanking panels and grommets to minimize air leaks in the raised metal floor.  Traditionally, the hardest part has been identifying the trouble spots, involving time consuming temperature and air flow measurements. Intel Data Center Management provides much of this data ready made from operations. Typically this type of analysis is done by a consulting team and the cost of this exercise is high, anywhere between $50,000 to a $150,000 for a 25,000 square foot data center.  This analysis yields a single snapshot in time which becomes gradually more inaccurate as  the equipment in the data center is refreshed and reconfigured.

 

Deployment scaling can range from a small business managing a few co-located servers in a shared rack in a multi-tenant environment to organizations managing thousands of servers.

 

The event handling capability is an software abstraction implemented by the Intel DCM SDK running in a management console.  From an architectural perspective, and the fact that the number of nodes managed can range in the hundreds, it makes more sense to implement this capability as software rather than firmware.  Node Manager is implemented as firmware and it typically controls one server. The choice of SDK over a self-standing management application was also deliberate.  Although Intel DCM comes with a reference GUI to manage a small number of nodes as a self-standing application, it shines when it's used as a building block for higher level management applications.  The integration is done through a Web services interface. Documentation for Intel DCM can be found in http://software.intel.com/sites/datacentermanager/.

In a previous article we explored the implementation mechanisms for monitoring and controlling the power consumed by data center servers.  In this article we'll see that an ability to trim the power consumed by servers at convenient time represents a valuable tool to reduce stranded power and take maximum advantage of the power available under the existing infrastructure.  Let's start with a small example and figure out how to optimize the power utilization in a single rack.

 

 

Forecasting the power requirements for a server over the product’s lifetime is not an easy exercise.  Server power consumption is a function of server hardware specifications and the associated software and workloads running on them. Also the server’s configuration may change over time: the machine may be retrofitted with additional memory, new processors and hard drives. This challenge is compounded by more aggressive implementations of power proportional computing: servers of a few years ago exhibited little variability between power consumption at idle and power consumption at full power.

 

 

 

While power proportional computing has brought down the average power consumption, it also has increased its variance significantly, that is, data center administrators can expect wide swings in power consumption during normal operation.

 

Under-sizing the power infrastructure can lead to operational problems during the equipment’s lifetime: it may become impossible to fully load racks due to supply power limitations or because hot spots start developing.  This extra data center power capacity needs to be allocated for the rare occasion where it might be needed, but in practice and cannot be used because it is held in reserve, leading to the term "stranded power."

 

 

 

One possible strategy is to forecast power consumption using an upper bound.  The most obvious upper bound is to use the plate power, that is, the power in the electrical specifications of the server.  This is a number guaranteed to never be exceeded.  Throwing power at the problem is not unlike the approach of throwing bandwidth at the problem in network design to compensate for lack of bandwidth allocation capability and QoS mechanisms.  This approach is overly conservative because the power infrastructure is designed by adding the assumed peak power for each server over the equipment’s life time, an exceedingly unlikely event.

 

 

 

The picture is even worse when we realize that IT equipment represents only 30 to 40 percent of the power consumption in the data center as depicted in the figure below.  This means that the power forecasting in the data center must not only include the power consumed by the servers proper, but also the power consumed by the ancillary equipment, including cooling, heating and lighting, which can be over twice the power allocated to servers.

 

Establishing a power forecast and sizing up a data center based on nameplate will lead to gross underestimation of the actual power needed and unnecessary capital expenses[1]. The over-sizing of the power infrastructure is needed as insurance for the future because of the large uncertainty in the actual power consumption forecast.  It does not reflect actual need.

 

pyramid.png

 

Power allocation in the data center.

 

A more realistic factor is to de-rate the plate power to a percentage determined by the practices at a particular site.  Typical numbers range between 40 percent and 70 percent.  Unfortunately, these numbers represent a guess representative over a server’s lifetime and are still overly conservative.

 

Intel(r) Data Center Manager provides a one year history of power consumption that allows a much tighter bound for power consumption forecasting.  At the same time, it is possible to limit power consumption to ensure that group power consumption does not exceed thresholds imposed by the utility power and the power supply infrastructure.

 

 

 

Initial testing performed with Baidu and China Telecom indicates that it is possible to increase rack density by 40 to 60 percent using a pre-existing data center infrastructure.

 

 

 

We will explore other uses in subsequent articles such as managing servers that are overheating and dynamically allocating power to server sub-groups depending on the priority of the applications they run.


[1]Determining Total Cost of Ownership for Data Center and Network Room Infrastructure, APC Paper #6 and Avoiding Costs from Oversizing Data Center and Network Room Infrastructure, APC Paper #37, http://www.apc.com

The recently introduced Intel® Xeon® 5500 Series Processor, formerly code named Nehalem brings a number of power management features that not only improve on energy efficiency over previous generations, such as a more aggressive implementation of power proportional computing.  Depending on the server design, users of Nehalem-based servers can expect idle power consumption that is about half of the power consumed at full load, down from about two thirds in the  previous generation.

 

A less heralded capability for this new generation of servers is that users can actually adjust the server power consumption and therefore trade off power consumption against performance.  This capability is known as power capping. The power capping range is not insignificant.  For a dual socket server consuming about 300 watt at full load, the capping range is in the order of 100 watts, that is, for a fully loaded server consuming 300 watts, power consumption can ratcheted down to about 200 watts.  The actual numbers depend on the server implementation.

The application of this mechanism for servers deployed in a data center leads to some energy savings.  However, perhaps the most valuable aspect of this technology is the operational flexibility it confers to data center operators.

This value comes from two capabilities:  First, power capping brings predictable power consumption within the specified power capping range, and second, servers implementing power capping offer actual power readouts as a bonus: their power supplies are PMBus(tm) enabled and their historical power consumption can be retrieved through standard APIs.

With actual historical power data, it is possible to optimize the loading of power limited racks, whereas before the most accurate estimation of power consumption came from derated nameplate data.  The nameplate estimation for power consumption is a static measure that requires a considerable safety margin.  This conservative approach to power sizing leads to overprovisioning of power.  This was OK in those times when energy costs were a second order consideration.  That is not the case anymore.

This technology allows dialing the power to be consumed by groups of over  a thousand servers, allowing a power control authority of tens of thousands of watts in data centers.  How does power capping work?  The technology implements power control by taking advantage of the CPU voltage and frequency scaling implemented by the Nehalem architecture.  The CPUs are one of the most power consuming components in a server.  If we can regulate the power consumed by the CPUs we can have an effect on the power consumed by the whole server.  Furthermore, if we can control the power consumed by the thousands of servers in a data center, we'll be able to alter the power consumed in that data center.

Power control for groups of servers is attained by composing power control capabilities of power control of each server.  Likewise, power control for a server is attained by composing CPU power control as illustrated in the figure below.  We will explain each of the constructs in the rest of this article.

hierarchy.png

Conceptually, power control for thousands of servers in a data center is implemented through a series of coordinated set of nested mechanisms.

 

The lowest level is  implemented through frequency and voltage scaling: laws of physics dictate that for a given architecture, power consumption is proportional to the CPU's frequency and to the square of the voltage use to power the CPU.  There are mechanisms built into the CPU architecture that allow a certain number of discrete combinations of voltage and frequency.  Using the ACPI standard nomenclature, these discrete combinations are called P-states, the highest performing state is nominally identified as P0, and the lower power consumption states are identified as P1, P2 and so on.  A Nehalem CPU supports about ten states, the actual number depending on the processor model.  For the sake of an example, a CPU in P0 may have been assigned a voltage of 1.4 volts and 3.6 GHz, at which point it draws about 100 watts.  As the CPU transitions to lower power states, it may have a state P4 using 1.2 volts running at 2.8 GHz and consuming about 70 watts.

 

The P-states by themselves can't control the power consumed by a server.  The CPU itself has no mechanisms to measure the power it consumes.   This mechanism is implemented by firmware running in the Nehalem chipset. This firmware implements the Intel(r) Dynamic Node Power Management technology, or Node manager for short..  If what we want is to measure the power consumed by a server, looking only at CPU consumption does not provide the whole picture.  For this purpose, the power supplies in Node Manager-enabled servers provide actual power readings for the whole server.  It is now possible to establish a classic control feedback loop where we compare a target power against the actual power indicated by the power supplies.  The Node Manager code manipulates the P-states up or down until the desired target power is reached.  If the desired power lies between two P-states, the Node Manager code rapidly switches between the two states until the average power consumption meets the set power.  This is an implementation of another classic control scheme, affectionately called bang-bang control for obvious reasons.

NM.png

From a data center perspective, regulating power consumption of just a single server is not an interesting capability.  We need the means to control servers as a group, and just as we were able to obtain power supply readouts for one server, we need to monitor the power for the group of servers to allow meeting a global power target for that group of servers.  This function is provided by a software development kit (SDK), the Intel(r) Data Center Manager or Intel DCM for short. Notice that DCM implements a feedback control mechanism very similar to the mechanism that regulates power consumption for a single server, but at a much larger scale.  Instead of watching one or two power supplies, DCM oversees the power consumption of multiple servers or "nodes", whose number can range up to thousands.

 

dcm.png

 

Intel DCM was purposely architected as an SDK as a building block for industry players to build more sophisticated and valuable capabilities for the benefit of data center operators.  One possible application is shown below, where Intel DCM has been integrated into a Building Management System (BMS) application.  Some Node Manager-enabled servers come with inlet temperature sensors.  This allows the BMS application to monitor the inlet temperature of group of servers, and if the temperature rises above a certain threshold, it can take a number of measures, from throttling back the power consumed to reduce the thermal stress on that particular area of the data center to alerting system operators.  The BMS can also coordinate the power consumed by the server equipment, for instance with the  CRAC fan speeds.

 

DataCenter.png

With this discussion we have barely begun to scratch the  surface of the capabilities from the family of technologies implementing power management.  In subsequent notes we'll dig deeper into each of the components and explore how they are implemented, how these technologies can be extended and the extensive range of uses for which they can be applied.

 

 

In our previous post we noted that the state of the art power montoring in virtualized environments is much less advanced than power monitoring applied to physical systems.  There is a larger historical context, and economic implications in the planning and operation of data centers that make this problem worth exploring.

 

 

 

Let's look at a similar dynamic in a different context: In the region of the globe where I grew up, water used to be so inexpensive that residential use was not metered.  The water company would charge a fixed amount every month and that was it.  Hence, tenants in an apartment would never see a water bill.  The water bill was a predictable cost component in the total cost of the building and included in the rent.  Water was essentially an infinite resource and reflecting this fact, there were absolutely no incentives in the system for residents to reign in water use.

 

 

 

As the population increased, water became increasingly a more precious and expensive resource.  The water company started installing residential water meters, but bowing to tradition, landlords continued to pay the bills, which was still a very small portion of the overal operating costs.  Tenants still had no incentive to save water because they did not see the water bill.

 

 

 

Today there are very few regions in the world where water can be treated as an infinite resources.  The cost of water increased so much faster than other cost components to the point that landlords decided to expose this cost to tenants.  Hence the practice of tenants paying the specific consumption for the unit they occupy is common today.  Also, because this consumption is exposed at the individual unit level, the historical data can be used as the basis for the implementation of water conservation policies, for instance charging penalty rates for use beyond a certain threshold.

 

 

 

The use of power in the data center has been following a similar trajectory.  For many years the cost of power had been a noise level item in the cost of operating a data center.  It was practical to include the cost of electricity in the bill of the cost of the facilities.  Hence IT managers would never see the energy costs.  This situation is changing as we speak.  See for instance this recent article in Computerworld.

 

 

Recent Intel-based server platforms, such as the existing Bensley platform, and more recently, the Nehalem-EP platform to be introduced in March come with instrumented power supplies that allow the monitoring and control of power use at the individual server level.  This information allows compiling a historical record of actual power use that is much more accurate than the more traditional method of using derated nameplate power.

 

 

The historical information is useful for data center planning purposes by delivering a much tighter forecast, beneficial in two ways: by reducing the need to over-specify the power designed into the facility or by maximizing the amount of equipment that can be deployed for a fixed amount of power available.

 

From an operational perspective we can expect ever more aggressive implementations of power proportional computing in servers where we see large variations between power consumed at idle vs. power consumed at full load.  Ten years ago this variation used to be less than 10 percent.  Today 50 percent is not unusual.  Data center operators can expect wider swings in data center power demand.  Server power management technology provides the means to manage these swings, stay within a data center's power envelope, yet maintain existing service level agreements with customers.

 

 

There is still one more complication:  with the steep adoption of virtualization in the data center in the past two years starting with consolidation exercises, an increasing portion of business is being transacted using virtualized resources.  Under this new environment, using a physical host as the locus for billing power may not be sufficient anymore, especially in multi-tenant environments, where the cost centers for virtual machines running in a host may reside in different departments or even in different companies.

 

 

It is reasonable to expect that this mode of fine grained power management at the virtual machine level will take root in cloud computing and hosted environment where resources are typically deployed as virtualized resources.  Fine grained power monitoring and management makes sense in an environment where energy and carbon footpring is a major TCO component.  To the extent that energy costs are exposed to users along as the MIPS consumed, this information provides the checks and balances and the data to implement rational policies to manage energy consumption.

 

 

 

Based on the considerations above, we see a maturation process for power management practices in a given facility happening in three stages.

 

  1. Stage 1: Undifferentiated, one bill for the whole facility.  Power hogs and energy efficient equipment are thrown in the same pile.  Metrics to weed out inefficient equipment are hard to come by.
  2. Stage 2: Power monitoring at the physical host level implemented.  Exposes inefficient equipment.  Many installations are feeling the pain of increasing energy cost, but organizational inertia prevents passing costs to IT operations.  Power monitoring at this level may be too coarse grained, too little, too late for environments that are rapidly transitioning to virtualization with inadequate support for multi-tenancy.
  3. Stage 3: Power monitoring encompasses virtualized environments.  This capability would align power monitoring with the unit of delivery of value to customers.

Given the recent intense focus in the industry around data center power management and the furious pace of the adoption of virtualization, it is remarkable that the subject of power management in virtualized environments has received relatively little attention.

 

It is fair to say that power management technology has not caught with virtualization.

 

Here are a few thoughts on this particular subject, which I intend to elaborate in subsequent transmittals.

 

For historical reasons the power management technology available today had its inception in the physical world where watts consumed in a server can be traced to the watts that came through the power utility feeds.  Unfortunately, the semantics of power in virtual  machines have yet to be comprehensively defined to industry consensus.

 

For instance, assume that the operating system running  in a virtual image decides to transition the system to the ACPI S3 state, sleep to memory.  What we have now is the state of the virtual image preserved in the image's memory with the virtual CPU turned off.

 

Assuming that the system is not paravirtualized, the operating system can't tell if it's running in a physical or virtual instance. The effect of transitioning to S3 will be purely local to the virtual machine.  If the intent of the system operator was to transition the machine to S3 to save power, it does not work this way.   The virtual machine still draws resources from the host machine and requires hypervisor attention. Transitioning the host itself to S3 may not be practical as there might be other virtual machines still running, not ready to go to sleep.

 

Consolidation is another technology for reducing data center power consumption by driving up the server utilization rates.  Consolidation for power management is a blunt tool, where applications that used to run in a physical server are now virtualized and squished into a single physical host.  The applications are sometimes strange bedfellows.  Profiling might have been done to make sure they could coexist, as a priori, static exercise with the virtual machine instances treated as black boxes. There is no attempt to look at the workload profiles inside each virtualized instance and in real time.  Power savings come from an almost wishful side effect of repackaging applications formerly running in a dedicated server into virtualized instances.

 

A capability to map power to virtual machines, in both directions, from physical to virtual and virtual to physical would be useful from an operational perspective.  The challenge is twofold, first from a monitoring perspective because there is no commonly agreed method yet to prorate host power consumption to the virtual instances running within, and second from a control perspective.  It would be useful to schedule or assign power consumption to virtual machines, allowing end users tomake a tradeoff between power and performance.  Fine grained power monitoring would allow prorating power costs to application instances, introducing useful pricing checks and balances encouraging energy consumption instead of the more common method today of hiding energy costs in the facility costs.

Filter Blog

By date:
By tag: