1 2 3 Previous Next

The Data Stack

40 Posts authored by: Winston_Saunders

I know… you’re probably thinking, “what the *?” The phrases “Cost effective” and “HPC” seem as rarely seen together as are a double bill of Ferris Bueller’s Day Off and Rocky Horror Picture Show. But, in fact, with rapidly expanding efficiency and performance capability of supercomputing systems, electricity costs in large scale machines may warrant deeper scrutiny of the need for newer hardware.

 What got me thinking about this was a startling realization about systems

near the lower left “Corner of Inefficiency” of the familiar Exascalar plot below.

 

The systems near that lower left corner are almost a factor of one hundred less efficient than the most efficient systems of comparable performance. In other words they consume about one hundred times the energy for comparable work. This can be a big deal, for instance a 20kW system or a 2.0 MW system. If you think about the cost of electricity, there could be some real ROI there.

So the question is how to visualize that difference in cost. The point of what I discuss below is not to provide an accurate cost analysis for every application, but to show how this general framework can be put to use.

Costs of supercomputers, especially those at the forefront of innovation, are difficult to estimate. For the purposes here I chose to use a published cost of the Lawrence Livermore Labs Sequoia computer as the anchor point for this analysis. For comparison read about the ORNL supercomputer here. Assuming a constant $/flops one can easily scale capital cost according to performance. This scaling is shown as the horizontal lines in the Figure below.

Electricity costs also vary widely from location to location.  Industrial electricity costs are actually falling in the US, but for the sake of simplicity I have assumed $0.07/kWh with an assumed facility PUE of 1.6. $0.07 is about the average industrial electricity rate in the US.  This translates, conveniently, to a total energy cost of about $1/(Watt*Year). You can see system-level annualized energy costs in the Figure.

From this point it is pretty straight forward to calculate a payback time for replacing inefficient servers. It’s interesting they work out to be vertical lines. It’s interesting that they times for return on investment show up as vertical lines. It’s astounding that they are so short. In several cases, less than a year!

Again, this is not intended to be a definitive analysis of return on investment or total cost of supercomputer ownership. But I think this initial estimate is provocative enough to warrant further investigation. To me it looks like millions are on the table.

So, what are you waiting for?

I’ll admit it. I read some dismay the stories about the collapsing market for European carbon dioxide emissions allowances, as covered in the New York Times.

 

Without going into great detail, I feel the most economically efficient way to reduce the deferred costs of carbon emissions is to simply set a price for them today. Does buying roses grown in Africa produce more or less carbon than those grown in the Holland?  I’d say a carbon analysis of that supply chain borders on too complex. But if carbon impacts were fairly encumbered at each point of use with a cost, an efficient market would prefer the lower impact source to the extent that carbon futures affect the price of the commodity.

 

Sadly it appears we have taken a step away not only from that efficiency, but also from addressing the carbon problem.

 

So are the carbon credits DOA? After some digging, I think not. Here's why - carbon futures are traded on a market like any other commodity and are affected by the same supply and demand economics that affect the price of everything else. So let’s look at both.

 

The first data I looked at was to compare the fluctuating price of carbon with economic data in this case the changes in combined GDP of four major European economies (Germany, France, UK, and Italy). This shows an important correlation in the demand side. The amount of money available to chase EUA credits is limited.

Slide1.PNG

 

 

Another correlation is with electricity generation, which is a reasonable proxy for the demand to use credits. Here I just looked at France and Germany together. Although Europe uses multiple sources for electricity, the majority source is from fossil fuels.

 

Slide2.PNG

 

Again, the trend (at least of available data) shows a pretty good correlation. This is yet another way to look at the demand side of the equation. With lower economic output driving lower demand for electricity in turn diving lower demand for carbon allocations, the drop in price seems natural.

 

On the supply side, of course, the decision by the commission to not limit the number of allocations guaranteed an abundant supply of credits.

So do I think the carbon market idea is dead? No, I don’t. The data “behind the curtain” support the idea that the falling price of carbon allocations is just a simple matter of supply and demand.

 

Will demand rise and EUA prices rise in the future? Of course they will. Hence, while there may not be any short term imperative to oinvenst in low carbon and efficient technolgies, smart industries should be, I believe, investing now, in the down turn, to gain advantages from efficiency in the longer term.

There has been sporadic concern about the energy use of “cloud” data centers, even as recently as last week’s New York Times. From the outside the concern is understandable; cloud data centers consume an enormous amount of energy, they are large visible entities, and their number is growing.

 

Looking at the surface of a problem is not the same thing as understanding it deeply. Since about 2006, when the first studies of data center energy use raised alarm bells, the industry response has been unified, focused, and socially responsible.

 

The Green Grid, the premier Industry group focused on resource efficient IT, was launched in 2006 to address systematic improvements in efficiency. The wide adoption of their PUE metric has brought focus and results. While data center infrastructure once consumed half of the data center power, infrastructure now consumes less than 10% for state-of-the-art data centers.

 

In this same time period there have been huge breakthroughs in server efficiency. Through work at Intel on energy proportional computing the energy used to perform typical computations has been reducing by about 60% per year since 2006.

 

Generations of Compounded Efficiency Growth rev Augusat 2012.gif

 

This rate of improvement is far outside our normal experience and may be hard to fathom. Improving the fuel efficiency of a car at 6% per year would have increased mileage from 20 mpg to 28 mpg – Not too bad. A 60% improvement rate would increase that mileage to 300 mpg. Imagine filling your tank once every six months!

 

Finally, innovation in the cloud has helped to consolidate workloads and bring them from less efficiently used isolated “server rooms” to highly efficient shared cloud services. This sharing enhances the usage of all compute resources and leads to even greater efficiency.

 

While direct comparison is difficult, these gains can roll up into one “number” for the overall data center efficiency, figuring in the infrastructure effectiveness, computing efficiency, and how effectively all those resources are used. Taking accepted industry values, the cloud is at least a factor of six times more efficient than the “conventional” case.

 

Does the data center industry need to do more to improve technology adoption and efficiency? Absolutely. For example, many of the technologies adopted in the efficient cloud have been slow to penetrate inefficient legacy data centers. Has the industry been focused and responsible in its technical innovation? The answer is an unequivocal, “yes.” That work continues.

 

And I think above all the broader use of computing at an always-improving efficiency will continue to enhance our lives.

 

Feel free to comment here or find me on Twitter @WinstonOnEnergy.

A couple of weeks ago I published a blog about the Exascalar analysis for the June 2012 Green500 data. The Exascalar Analysis is a way of looking at the performance leadership of supercomputers that emphasizes the fundamental role of efficiency. Exascalar is a convenient measure of the "logarithmic distance” of a super computer from the Exascale goal of 1018 flops in an envelope of 20 mega watts.

 

In my last blog I showed how the BlueGene/Q computer family – which previously demonstrated efficiency leadership – this time around achieved the performance scale to push it to the top of the Top500, Green500, and Exascalar ranking.

 

We can visualize this evolution by looking at the Exascalar Analysis for the data form June 2011 to June 2012 on one graph.

2011 and 2012 Graph V2.png

 

The red triangles represent the data points of the June 2011 list and are plotted on top of the June 2012 data to better emphasize which systems are new or have changed.

 

The column of points on the right-hand side of the graph are the BlueGene/Q systems, all with comparable efficiency around 2000 Mflops/Watt. The highest performing system consumes nearly 8MWatt of power.

 

The impact of these systems on the Exascalar list is further emphasized by the movement of the “top10” boundary. Only three of the “Top10” from a year ago still make the list, showing the fast evolution of the list. And it was efficiency, not just raw performance, that drove this turn-over.

 

The next cluster of systems at around 1000 Mflops/Watt were also not present in the chart. These are dominated by the Intel Xeon E5 family coupled with GPU.

 

An interesting point is the Intel Xeon E5 coupled with MIC, which has an efficiency of 1300 Mflops/Watt. This system has a power consumption of just under 80kW for a performance of 108 Mflops. It’s interesting to compare this to the lowest efficiency system on the June2012 Graph. Also with a system performance near 108 Mflops it has a power consumption of over 3.5 MWatts! The efficiency advantage of Xeon is truly amazing.

 

Building on the above, it's instructive to visualize the trend of Exascalar plotted against time.

Trend Graph June 2012.png

In this graph I have shown both the Top and Median trends. As expected the median shows a much smoother progression than does the Top Exascalar trend due to the larger sample size.

 

A fit of the trends shows that Exascalar is decreasing a factor 0.35 per year, while the median is on a much slower cadence of 0.26 per year. The extrapolation of the Top Exascalar curve shows progress is reasonably on track toward the Exascalar goal.

 

The Green500 and Top500 data are extremely rich in information, but it’s hard to visualize trends in one without understanding what is going on in the other. Hence the Exascalar analysis was born. There are many trends in the data some of which I discussed here. For instance I think we should expect much more form Xeon and MIC systems in the future. And of course, it will be interesting to see the next generation of efficiency and performance leaders emerge in subsequent editions.

 

Here’s a question to all my loyal readers - answer if you dare.

 

What do you think the SHAPE OUTLINE of the Exascalar plot will be in 2013 and 2015?

 

Submit a link to a picture in the comments! I'll share my thoughts in a couple weeks.

Please note: A version of this blog appeared as an Industry Perspective on Data Center Knowledge.

 

 

 

The June 2012 Top500 and now the Green500 have been published so it’s about time to update the latest Exascalar analysis.

 

Exascalar is a way to synthesize the information in both the Top500 Performance ranking and the Green500 Efficiency ranking of supercomputers into one graph oriented toward Exascale computing goals.

 

Recall it is simply the logarithmic distance of a particular computer in both performance and efficiency from the exascalar goal of 10^18flops in a 20 MegaWatt envelope. An Exascalar of -2 is a factor of 100 away from the Exascale goal.

 

Here is Exascalar anlysis of the Green500 list just published.

 

Exacalar Graph.png

 

 

There are several comments that can be made about the graph. The first is characteristic “triangular” shape of the collected data points. This is due to the limitations power places on performance and a cut-off line in performance.This is the first time I've see what may be a cut-off in efficiency but will need to confirm that.

 

You can also see the dominant BlueGene/Q systems, which occupied roughly the top20 slots of the Green500 list and for the first time also led the Top500. These are the “column” of dots on the right side of the graph.

 

The green line in the graph is the “top Exascalar” system over the last five years. Over the period, Exascalar has increased a factor of 100 (from -4 to -2). Most of this gain has been in efficiency. In the future I expect almost all gains will be in efficiency (hence the structure of Exascalar graph).

 

Chart.png

 

The list of the top Exascalar systems show how the ranking stacks up with both the Performance and Efficiency rankings of the Top500 and Green500, respectively. The efficiency rankings of the BlueGene systems is a bit deceptive since all top 20 Green500 systems were BlueGene with the same (very high) efficiency. Overall BlueGene takes six of the top ten spots. The SPARC64 system, formerly the top system, declined to number 3. There are three Xeon based systems in the top ten.

 

Last fall the Top10 Supercomputers didn’t change at all but on the Green500 list there was quite a bit of action. This time of course the BlueGene/Q systems dominated both the Top500 and Green500.

 

In my next blog I’ll discuss the evolution of Exascalar since the last publication.

I recently published a blog on Data Center Knowledge about how energy proportionality has essentially doubled the server efficiency gains beyond what “Moore’s Law.”  This doubling was done by improving the “energy proportionality” of workload scaling.

 

A problem I’ve been stewing about is how to express proportionality in a simple way. There are of course lots of ways to think about it (and many of them equivalent). So what I am going to do here is propose a generic idea to emphasize an interesting insight into the proportionality of Xeon, and let other argue about the details.

 

Let’s divide this into three parts: 1) A framework, 2) Practical application to SPECPower results and 3) Insight from historical trends.

 

Part 1. A Framework

 

Establishing a framework for a “proportionality index,” of course, has some arbitrariness to it. Following some simple ideas about “what works for managers” let me just propose the following ideas. Looking at the graph of two hypothetical (and idealized) server load lines in the graph below you can see what distinguishes ideally Proportional Server “A” from non-ideal Server B is the area between the two curves. We can use this area difference as a metric of proportionality.

 

F1.png

 

It is relatively easy to show that the area difference, which I will call SPI, between straight line (B)

 

Power = b + (1-b) * Workload

 

and an ideal line (A) is

 

Area = b/2

 

Where b = Idle Power/Max Power.

 

We can turn this into a Server Proportionality Index (SPI) with the following formula

 

 

SPI = 2(1 - Area) = 2*(1 - b/2 )

 

The index is zero for a server that has no energy scaling and one for a server with “ideal” linear scaling.

 

Part 2. Application to SPECPower

 

This idea can easily be applied to data readily available in the SPECpower benchmark.

 

In the figure below some data from a recent measurement on a Dell PowerEdge Server based Xeon E5-2600 are shown.

 

F2.png

 

A little intuition will persuade you, and a little algebra will prove, that

 

SPI = 2(1 - AveragePower/PeakPower)

 

Where AveragePower is just the arithmetic average of the “average active power” measurements of SPECpower and PeakPower is the power at the targeted 100% load point. In the data set shown

 

SPI = 2(1 - (134 Watts)/(246 Watts)) = 0.90

 

This value is very close to ideal. Note that the idle power to max power ratio is about 0.21. The higher efficiency of the system mid-load improves the SPI. This emphasizes the importance of measuring the whole load line and not just the end points.

 

Part 3. Understanding Trends

 

Going back to the SPECPower database, I pulled the historical trends of volume two socket servers based on the Intel Xeon Processor and calculated the SPI for them all. I plotted the results in two ways to emphasize particular aspects about the dependence on the load line.

 

F3.png

 

The graph on the left shows the trend of SPI versus the ratio of Idle/Max Power. The points generally follow closely the “ideal line” show as a dashed line on the plot. The notable exception is the departure at the lower end as noted above. This departure reveals very clearly why an improved way of thinking about proportionality is needed; the load lines of real servers are no longer well approximated by linear functions.

 

Looking at the second graph the historical trend of proportionality reveals very clearly the architectural transitions between families of Xeon processors. With each successive generation the proportionality index (as measured here) has improved in the range of 0.2 per generation. 

 

So there you have it: a simple way to analyze the energy proportionality of non-linear servers and a simple formula for calculating a “Proportionality Index.” The historical trend shows not only the clear deviation from linearity but reveals the major architectural transitions in Xeon processor families.

 

The data pose an interesting question: is a straight line really the “ideal load line?” It is very conceivable (and certainly theoretically possible) that we will see SPI > 1 in the near future. So is proportionality really the right end goal? What would be the ideal load line? Is there such a thing? Any opinions out there?

Please note: This blog originally apeared on Data Center Knowledge as an Industry Perspective.

 

 

As data centers have grown over the years, server power consumption has taken to center stage in the IT theater. Electricity to power servers is now the biggest operational cost in the data center, and one of the biggest headaches for budget managers.

 

So how do you contain server power consumption? I suggest you begin by looking first at inefficient servers—the elephant in your data center. Old and inefficient servers not only consume more power than newer servers, but they do less work. That means you’re paying more to get less.

 

At Intel, we’ve had a laser focus on this issue for many years now, and the new Intel® Xeon® processor E5 family continues this focus. It addresses the efficiency problem on two key fronts: processor performance and power management.

 

To increase server performance, the Intel architecture builds hyperthreading technology into the processor. In simple terms, hyperthreading overlays instruction paths to double the number of processor cores and deliver a lot more throughput for the same amount of energy.

 

For further gains in power efficiency, the processor includes a turbo feature that allows energy to be focused where it is most needed. If a job running on one core needs more power, it can make use of the extra power headroom available on other cores to accelerate processing.

 

Other automated power management features in the new processor family include Intel Power Tuning Technology and Intel Intelligent Power Technology. Power Tuning uses on-board sensors and to give you greater control over power and thermal levels across the system. Intelligent Power Technology automatically regulates power consumption

 

With capabilities like these, the newest Intel Xeon processor products families deliver up to 70 percent more performance per watt than previous generations.[i],[ii] These gains help you flip the inefficiency ratio that comes with older servers. Rather than paying more to get less, you pay less to get more.

 

 


 


 

[i] Software and workloads used in performance tests may have been optimized for performance only on Intel® microprocessors. Performance tests, such as SYSmark* and MobileMark*, are measured using specific computer systems, components, software, operations, and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products.

 

 

[ii] Source: Performance comparison using SPECfp*_rate_base2006 benchmark result at the same TDP. Baseline score of 271 on prior generation 2S Intel Xeon processor X5690 based on best publication to www.spec.org using Intel Compiler 12.1 as of 17 January 2012. For details, please see:  http://www.spec.org/cpu2006/results/res2012q1/cpu2006-20111219-19195.html.  New score of 466 based on Intel internal measured estimates using an Intel Canoe Pass platform with two Intel Xeon processor E5-2680, Turbo Enabled, EIST Enabled, Hyper-Threading Enabled, 64 GB RAM, Intel Compiler 12.1, THP disabled, Red Hat* Enterprise Linux Server 6.1.

PUE has been a hugely successful efficiency metric in quantifying the discussion of data center infrastructure efficiency. Of course, infrastructure is not the only thing in a data center, and we have proposed “SUE” as “Part B” of the data center efficiency equation to address the important aspect of compute efficiency. SUE is a similarly derived IT performance metric which is gaining traction in application.

 

Though neither metric is “perfect,” both have a low barrier for adoption and are meaningful in a big picture perspective (so long as you don't get too hung up on tricking the metric at the expense of other important parameters). Another powerful aspect driving acceptance of PUE and SUE is they fit easily into grammatical sentences. If your PUE is 2.0 you’re using twice the energy you need to support your current IT infrastructure. If your SUE is 2.0 you’re operating twice the number of servers you need to support you current IT workload. Both convey obvious business impact.

 

So what about the “holy grail,” data center work-efficiency?

 

There’s broad industry recognition (as Ian Bitterlin says, “it is the 1.0 that is consuming 70% of the power”), and a lot of work is going on to understand it. For instance, The Green Grid published the DCeE or a Data Center Efficiency Metric back in 2010, based on a view toward quantifying “useful” work output of the data center.

 

However, this sophisticated approach really has to do with application-level details and has not yet gained wide industry traction. This is partly, I believe, because of its complexity; the barrier to entry is an investment in highly granular data analysis which is more than many operators need or will support.

 

So I asked myself, “what are the alternatives?” Can we lower the barrier to entry in the way PUE and SUE have done for infrastructure and IT efficiency and define a Data Center Capital Usage Effectiveness (DCUE) taken as the ratio of two quantities with units of “Work/Energy?”

 

Well, the short answer is, we can. The starting point is the very simple idea that:

 

Work/Energy = Integrated (Server Performance * Utilization)/(Total Data Center Energy)

 

The big assumptions are: 1) it assumes statistical independence of server performance and utilization, 2) it tacitly assumes CPU performance and utilization drive work output (though this simplifying assumption can be removed with more complexity), and 3) it neglects things like network and storage efficiency (which are minority energy consumers in most data centers). Not perfect, but tractable.

 

The DCUE formula has the advantage of providing an easy entrée into the analysis of the work efficiency of the data center; it focuses on what many consider the big three: Infrastructure efficiency, IT equipment efficiency, and it quantifies how effectively the capital asset is being utilized (thanks to Jon Koomey for pointing that out to me).

 

Roughly, here is how the numbers work (these are made up data but are representative based on experience): Imagine a typical data center with a PUE of 2.0. If the data center is on a refresh cycle of six years its SUE will be about 2.4, and the server utilization might be about 20% in an enterprise with a low level of virtualization.

 

An efficient data center might have a PUE closer to 1.3, a more aggressive three year server refresh rate with an SUE of about 1.6, and might increase utilization to 50% with both higher rates of virtualization and perhaps utilize technology like “Cloud Bursting” to handle demand-peaks.

 

The math reveals a Data Center Capital Usage Effectiveness (DCUE) opportunity of about 6 times between the two scenarios.

 

Data Center

PUE

SUE

Utilization

DCUE

“Typical”

2.0

2.4

20%

24

“Efficient”

1.3

1.6

50%

4

 

In fact, a“Cloud” DCUE could even be higher with more aggressive server refresh, lower PUE, and higher utilization levels, whereas typical enterprise utilizations might be lower.

 

My friend Mike Patterson here at Intel is always challenging, “so… what does it mean?” Well just as PUE and SUE represent “excess” quantities, a DCUE of 24 means you are using about 24 X the energy (and hence data center capital) as you'd need at optimum efficiency. That means 24 times the data center capital. A pretty powerful argument to improve.

 

So there you have it, "The Big Three" for data center capital efficiency: 1. How efficient if your infrastructure? 2. How effective is your server compute capability? and 3. what is the utilization of your capital assets?

 

In subsequent blogs, I’ll talk more about these ideas and some of the issues we still need to think about. But until then, I'm curious what you think. Right track? Wrong track? Why?

Please note: This blog post originally apeared as an industry perspective on Data Center Knowledge.

 

 

Among important data center industry milestones this year is the fifth anniversary of The Green Grid, the premier international consortium for resource-efficient IT. Formed by eleven founding member companies in 2007 the organization grew rapidly and today boasts approximately 150 General and Contributing and ten Board member companies. Since its formation the organization has contributed a tremendous amount to the data center “science” of efficiency.

 

Here is just a partial list of key results and contributions made by the Green Grid so far:

 

Harmonization of the PUE metric: Prior to the Green Grid there was no agreed standard to understand or compare the impact of infrastructure on data center efficiency. PUE is a great example of Peter Drucker’s adage, “What gets measured gets done.” Average reported PUEs have dropped from 2.2 in a 2006 LBNL study to 1.6 in a survey of TGG members in April 2011, a 50% reduction is overhead energy. In fact,  PUE was just adopted by ASHRAE (pending public review) into Std 90.1

 

The Green Grid Energy Star Project Management Office acts as a good-faith Industry Interface to the EPA for Energy Star Rating on data centers, servers, UPS’s and storage. The work done in the Green Grid has ironed out differences of opinion between industry members and, in my opinion, improved Energy Star by making it a user-relevant measure of efficiency. Several member companies have such confidence in the Green Grid’s work they decline to make individual company responses to the EPA.

 

Data centers use a lot of water and the Green Grid, again taking the forefront, has developed a water usage effectiveness standard, WUE to standardize the measurement of water usage, its reporting, and to encourage resource efficiency.

 

The Green Grid produced highly influential “Free Cooling” tools and maps to aid in data center site selection. The Green Grid has been among leading voices is advocating the use of free “outside air” and economizers for efficient cooling of data centers. Both approaches can substantially reduce the energy consumption of data centers compared to conventional reliance on air conditioners and air handlers alone. The maps have been downloaded more than 11,000 times in the last 2 years.

 

A Roadmap to the Adoption of Server Power Features. Published in 2010, it is one of the most (if not the most) comprehensive analyses available of server power management capability, how it is deployed, industry perception, and barriers to adoption. Strategic in nature, the study not only recommends concrete action today, but suggests future work to enable this fundamental aspect of data center efficiency.

 

The comprehensive Data Center Maturity Model, which helps data center operators quickly assess opportunities for greater sustainability and efficiency in their data center operations. Released just a year ago, it’s a popular invited talk at international conferences, not only for the results it promises today, but for the five year roadmap it lays out for the industry.

 

The Green Grid has facilitated international agreements with Japanese and European efficiency orgs, and with folks like ASHRAE, ODCA, and the Green500. Interested how “containers” affect data center efficiency? There’s a Green Grid Container Task Force on that.

 

But the story doesn’t stop there. Ongoing work will quantify software efficiency and develop Productivity Proxy’s to help measure data center work output in standardized and user-relevant ways. Further chapters of the Data Center Design Guide will]help provide guidance for those building new data centers . There are plans afoot in the Green Grid to develop IT recycling metrics and work is starting to focus on the role data centers can play in a “Smart Grid.”

 

In all, quite a list of accomplishments. So, if you want to learn more about what the Green Grid has accomplished in the last five years, how its work has contributed value to its member base, or have an interest in shaping the next five years of this exciting industry, please plan to attend the upcoming Green Grid Forum in San Jose, CA, March 6-7, 2012.

I published a blog late last year on an idea bringing insight from the Green500 and Top500 together in a way that helps to better visualize the changing landscape of supercomputing leadership in the context of efficient performance. Since then I have started to refer to that analysis by the shorthand term “Exascalar.”

 

Recall, Exascalar is a logarithmic scale for supercomputing which looks at performance and efficiency mormalized to Intel’s Exascale Goal of delivering one Exaflops in a power envelope of 20 MegaWatts.

 

Of the emails I received on the topic and one of the most interesting was from Barry Rountree at LLNL. Barry has done a similar analysis looking at the time evolution of the Green500 data. So I thought, “why the heck not for Exascalar?”

 

And then I had some fun.

 

Building from Barry’s idea, I plotted the data for the Green500 and Top500 from November 2007 to November 2011 in one year increments (with the addition of the June 2011 data for resolution) as an animated .gif file shown below. The dark grey line is the trend of the "Exascalar-median." To highlight innovation, in each successive graph the new points are shown in red while older systems are in blue. The unconventional looking grid lines are constant power and exascalar lines.

 

 

ExaAnim3.gif

Please click image to animate

 

 

There’s a lot going on here. One notices some obvious “power pushes” where occasionally a system pushes right up against 20 MW line to achieve very high performance. Invariably these achievements are eclipsed by systems with higher efficiency.

 

Another thing that’s striking is the huge range off efficiencies between systems; over a factor of one hundred for some contemporary systems with similar performance. That’s pretty astounding when you think about it - a factor of one hundred in energy cost for the same work output.

 

But the macroscopic picture revealed, of course, is that the overall (and inevitable) trend shows the scaling of performance with efficiency.

 

So how is the trend to Exascale going? Well one way to understand that is to plot the data as a time series. The graph below shows the Exascalar ranking of Top, the Top 10, and the Median systems over time. Superimposed is the extrapolation of a linear fit which shows why such a huge breakthrough in efficiency is needed to meet the Exascale goal by 2018.

 

 

Exa Trand.gif

 

It’s remarkable that Top10 and Top Exascalar trends have essentially the same slopes (differing by about 7%) , whereas the slope of the Median trend is about 20% lower.

 

But these simplified trends belie more complexity “under the covers.”  To look at this I plotted the Top 10 Exascalar points from 2007 and 2011 and then superimposed trendlines from the data of intervening years. Whereas the trend line of the “Top” system has really trended mostly up in power while zigging and zagging in efficiency, the trend of the “Top10” (computed as an average) is initially mostly dependent on power, but then bends to follow an efficiency trend. Note that the data points are plotted with a finite opacity to give an sense of "density." (Can you tell I'm a fan of "ET"?)

 

Exascalar Trend Analysis Mathematica.gif

 

This is another manifestation  of the “inflection point” I wrote about in my last blog, where more innovation in efficiency will drive higher performance as time goes forward, whether in emerging Sandybridge, MIC, or other systems which have focused on high efficiency to achieve high performance. This anlaysis highlights what I think is the biggest trend in supercomputing, efficiency, while capturing the important and desired outcome, which is high performance. As my colleague and friend here at Intel, John Hengeveld writes: “Work on Efficiency is really work on efficient performance.”

 

What are your thoughts? Weather-report or an analysis that provides some insight?

 

Feel free to comment or contact me on @WinstonOnEnergy

One of my more popular blogs earlier this year was about “The Elephant in your Data Center" -- inefficient servers. As I explained, older, inefficient, under-performing servers rob energy and contribute very little to the “information-work” done by a data center.

 

Almost everyone already knows that, of course. The contribution of the blog was to take a potentially complex idea (relative server performance) and build a simple way to access it.

 

The blog proposes a metric called SUE (Server Utilization Effectiveness). We build the idea based on practical experience with lots of input from our Intel IT and DCSG experts. The notion was very similar to Emerson’s CUPS metric with the added twist to normalize so that SUE = 1.0 was ideal and larger numbers were worse (consistent with the way PUE is defined, for better or worse!). Mike Patterson and I discussed some of the benefits of the SUE  approach in a recent Chip Chat podcast on data center and server efficiency with Allyson Klein.

 

The overarching message is that SUE complements PUE in the sense that PUE looks at the building infrastructure efficiency, and SUE looks at the IT equipment efficiency in the data center.

 

The proposal for SUE was primarily oriented around usability. We wanted a way to go into a data center an make an assessment quickly and for a low cost. So, we focused on a simple age-based metric for relative performance. The simplification got a lot of comments, and one was, “what if I want more precision?” The good news is there are answers out there for you. I summarized the results of the discussion below:

 

 

SUE Maturity Model.jpg

Please click on the image for an enlarged view


 

I chatted with Jon Haas here at Intel about this problem. Jon leads the Green Grid’s Technical Committee where he and industry partners are collaborating to run experiments on more accurate Productivity Proxies for server work output. Of course, running a proxy on your server configuration is something that might take longer than a few days, and would occupy some precious engineering resources. But given the high operating and capital costs, the accuracy benefit in many cases will make solid business sense.

 

There are other ways to measure server and data center performance. A common way to estimate server performance and efficiency is to look up published benchmark scores. Depending on the server model, configuration, and workload type of interest, these table look-ups can be accurate without consuming a lot of time and resources.

 

And finally, many advanced internet companies instrument their applications directly to monitor performance. This represents the highest investment level, but produces the highest accuracy.

 

In all cases, the normalization of the actual server performance to the performance of state-of-the-art servers will produce numbers that can be correlated to SUE in the manner discussed in my previous blog and podcast.

 

The good news is that you can find out more about progress on the proxy front, and more at the upcoming Green Grid Forum in San Jose this coming March.

 

As always, I welcome your comments. The idea, as originally proposed, was closer to conceptual than realizable. Yet, taking into account a maturity model, I think it starts to have legs as something which can be standardized. What do you think?

I was recently quoted in an article by Randall Stross of the New York Times, as part of my role in the Green Grid, regarding how the conceptual “Data Furnace” might improve the energy efficiency of my vacation home in Central Oregon. In winter, my electric bills are quite high; I need to leave some electric heat running all the time to keep the pipes from freezing. When I arrive for a weekend of skiing, I turn up the electric heat until the pellet stove warms up. It costs me a small fortume.

 

How would a “data furnace” improve the efficiency of my home? Well, it wouldn’t in the sense that physics thinks about efficiency. But from an economic perspective, it could. Computers fundamentally turn electrical energy into heat. The difference is that computers provide a computational resource while doing so, which might be solving protein structures or even be billed on a compute trading scheme as a cloud resource. That’s energy that doesn’t need to be spent elsewhere, all while providing exactly the same heat to my home.

 

Now, although with wide variance, it’s generally estimated that about 2% of the world’s energy is spent on computing. I spend essentially all of my professional life making that energy use more efficient.

 

This morning I asked an interesting “out of the box” question: “ what if the other 98% computed?” Of course it’s impractical to think of all that energy computing but the scale of 50:1 gives you some pause. What if?

 

What about water heating? According to the US Department of Energy I can expect to spend about $300 per year on electrical energy for water heating (about 5000 kWh). This is more than enough energy to run two highly efficient servers at full load continuously for an entire year.

 

Clothes dryers consume up to 12% of household electricity. What about the heater in your dishwasher? Your waterbed? Your aquarium? Your coffee maker?

 

It’s not too out of the blue to imagine that all of these resources could, in some not too distant future, provide useful computational work. While a detailed business model would present some unique challenges, it is certainly an intriguing idea to think that not only should all energy that computes be as efficient (i.e. heat as little) as possible, but indeed, that all energy heats should also compute as much as possible..

 

How would this solve my particular problem? Well, imagine if I could offset the cost of the electricity I use with a higher value-add business service. This can be seen in the picture below. When I need to generate heat, an intermeidiate service could auction that resource to a bidder. In the right circumstances, it could be a win-win. Someone gets a low cost compute resource. I get help with my electricty bill.

 

Compute-regulated energy delivery2.jpg

 

Click on the image for a larger view

 

What an interesting challenge! Think of the benefit to society that opportunity could deliver! How much faster could we decode everyone's genome? How much faster could we advance our understanding of fundamental matter and black holes? How much faster and more efficienctly could we render movies? What about digesting ever-larger data-sets?

 

So, what are the biggest challenges with this, and how would you solve them? Software architecture? Security? Reliability? Market models? Your comments are welcome.

The November 2011 Top500 super computer list released last week marked a milestone. Despite some notable new entrants, it was the first time in the history of the list that the top performing ten didn’t change.  Does this mean innovation in the "nose bleed" seats of the HPC arena has stopped? Hardly, it just means the focus might be shifting.

 

The Green500, published concurrently with the Top500, was more dynamic. In June 2011 the top two efficient supercomputers were BlueGeen/Q systems. Now the top five are BlueGene/Q. Surprisingly the top efficiency decreased slightly from 2097 Mf/Watt to 2026Mf/Watt. And while BlueGene continues to top the list, systems with GPU accelerators (from a range of manufacturers from nVidia to Intel) combined with CPUs from Intel and AMD made a strong showing.

 

What is surprising is that in the top ten of the Green500 and Top500 there almost no overlap (just one system). So, with the race to the top of  supercomputing increasingly about efficiency and performance leadership, what does leadership mean? Well, that judgment, of course, depends on the goal.

 

I recently proposed an approach to combine the Top500 and Green500 performance and efficiency scales into a single metric (which I will call “Exascalar” henceforth). The thinking behind it is straight forward: since both efficiency and performance are required as the industry pushes toward the next big goal a good metric will balance the two.

 

So what happened this time? Well before getting started please note this is an informal ranking done by me without formal peer review. Any errors are my responsibility. Comments, inputs, and corrections are appreciated.

 

As a refresher, the graphical representation of the performance and efficiency data shows how they are combined to form the Exascalar. Exascalar is the negative logarithm of  “how far away” a system is from meeting the Exascale goal of 1.0 Exaflops/20MWatt. Note the iso-power lines and iso-exascalar curves in the graph. (One reason I like this approach is that, for a given efficiency, I can directly read what is the expected performance limit in a given power envelope)

 

Exascalar Graph November 2011.jpg

Click to Enlarge

 

 

The details of the top ten based on the Exascalar are shown in the list below. The top three computers on the list are unchanged from the last look.  The RIKEN K Computer with its muscular performance, and then two systems based on Xeon5670 with nVidia GPUs. The third place system, The GSIC Center, Tokyo system based on Xeon 5670 and nVidia GPU is notable since it is the only system on the list having top ten performance and efficiency.

 

Next on the list is the DOE/NNSA/LLNL BlueGene/Q system, which ranks fourth in Exascalar based on the strength of its very high efficiency: it ranks fourth in efficiency and fourth in exacalar.  It’s a great example showing that efficiency and performance, and not just scale, counts. Judging from the position of the BlueGene/Q systems on the graph above, there certainly appears to be more headroom in the future, with its current power about one twentieth the power of the number one system.

 

Below number seven is where I think the race gets most interesting. The Chinese National Supercomputing Center in Jinan Sunway system makes a very strong showing in the combined ranking. It is the first system on the list that is in neither the top ten of performance nor efficiency; its strength is its balance of performance and efficiency.

 

Rounding out at number ten is a very strong showing by a Xeon E5 (Sandy Bridge-EP) system, again with strong balance between high efficiency and performance. It's a remarkable achievement for a processor this new to make it to a top ten spot and I think begins to show us what the future looks like.

 

 

Exascalar Ranking November 2011.jpg

Click to Enlarge

 

Overall six of the former Exascalar top ten remained on the list as compared to last Spring. Although the top of the list didn’t change, the tenth system improved from 3.75 to 3.65, a significant improvement in performance and efficiency (recall Exacalar is logarithmic).

 

The most significant moves were by systems with very strong efficiency and those achieving that delicate balance between high efficiency and performance. Systems that pushed performance over efficiency moved relatively down in the ranking this time. This could be a trend that will continue to define the future of supercomputing, though only time will tell for certain.

 

Supercomputing is first and foremost about performance, but is also increasingly constrained by power. Looking at both performance and efficiency combined may give us better insight into how the race to Exascale is shaping up, and ultimately who will win.

 

In reviewing this with my friend and colleague Mike Patterson, he asked me a very interesting question, “what information is contained in the slope of the line to Exascale?” I have an idea, but am interested in your thoughts. What, if anything, does the slope of the line to Exascale tell us?

 

And of course, any additional thoughts, comments or insights are welcome. Is the focus shifting? Does this provide insight? What do you predict will happen in the future?

Computer World has just released their Top Green IT User and Vendors. I was of course proud that Intel made the list – we spend a lot of effort on efficiency and it’s nice to see it recognized.  What I wanted to write about, though, is the success of Kaiser Permanente and their lessons learned.

 

Kaiser Permanente, in Oakland CA, reduced their data center energy use   by 6% in a by the template use of the Organize, Modernize and Optimize imperative. They Organized by ensuring their data center facilities team was part of IT. Perhaps the single most powerful organizational mandata a CIO can make. They Modernized by virtualizing workloads on efficient IT equipment to maximize the effectiveness of their IT resources. And they Optimized, by isolating cold and hot air in their data center and by running a fluid dynamics model of their data center continuously, to look for opportunity.

 

The prescription is one almost any IT organization can follow.

 

Organize:

One of the biggest problems (still!) in the data center is getting started. Making just a simple change to organization responsibility of ensure data center owners are responsible for the site energy consumption is probably the most impactful long term change a CEO or CIO can make. Beyond that, measuring costs, power consumption, and data center productivity are about all you need to start making the right decisions.
 
Modernize:

The next step is to make sure the IT equipment in the data center is a efficient as possible. In many cases, this in and of itself can make a tremendous difference in energy consumption. For instance Television Suisse Romande just reduced the number of servers by about 50% through consolidation.
 
Optimize:

The last big step is making the entire data center run as efficiencly as possible. Why do this last? Well, you can’t really optimize it if you can’t measure it, so you need to get organized. And if you have missed the opportunity to reduce the number of servers by 50%, why take 10% off your PUE when, in a few months when you do need to replace those servers, you’ll just have to do the work again. For example in the Datacenter2020 collaboration the results indicated cooling did not need to be as high as initially anticipated.

 

A very impressive job!

You can’t beat the energy generated by assembling some of the best minds in the industry, as was done at the recent Open Compute Project Summit in New York City. The venue was amazing, held on a roof top with views of the Empire State Building, but the content was what impressed me the most.

 

Among the most important announcements from the Open Compute Project was the creation of an Industry Board of Directors.

The Industry Board of Directors consists of:

 

 

Along with talks by the board members were several interesting talks in the opening sessions. Here are just a few tidbits I found interesting:

 

Andy Bechtolsheim explained how open standards had always served to accelerate innovation. I liked it when he emphasized what Open Compute is not: it is not a standards group -- it is there to complement. It is not a customer panel--it is individuals who will make the work happen. And it is not a marketing organization--it will do things.

 

James Hamilton of Amazon Services, talked about how data center TCO drives decision making. He mentioned that Amazon’s business success hinges on the efficiency of its data center and computing infrastructure. According to James, “any workload worth more than the marginal cost of power is worth running on a server".

 

Jason Waxman, Director of High Density Computing at Intel, talked about Intel’s focus on high efficiency mother boards, Open Rack standards, optimized high temperature designs, microserver, scalable light-weight systems managements, and vendor enabling and innovation. With the already announced alliance of the Open Compute Project and the Open Data Center Alliance, we can expect a strong linkage between usage-based demand and engineering driven solutions.

 

Jimmy Pike, Dell DCS, emphasized his company’s commitment to remove accumulated overdesign across the infrastructure stream using standard form, fit, interfaces and technology elements. The idea of eliminating “gratuitous differentiation” to the benefit of customer value was very well received by the audience.

 

Open Compute has already achieved a lot, from 480VAC, reducing fan power to 6W per server, to sharing architectures that eliminate extra transformers and power conversions, all for efficiency’s sake. With the industry now starting to pull together, we can certainly expect a lot more.

Filter Blog

By date:
By tag: