Home > Intel Communities > Open Port IT Community > The Server Room > Blog > Tags > scalability

The Server Room Blog

5 Posts tagged with the scalability tag
1

I have been around the supercomputing market for over 25 years and have had an opportunity to see some interesting ideas come and go.  Let me share two that I experienced firsthand. 

·         CDC’s Cyber205 or a Cray 1S.  The CRAY-IS and the CDC Cyber 205 both offered effective vector processing, however, code conversion between them may have required some significant algorithmic changes. Cray of course won the HPC race at that time.  Note, the Cyber 205 was a tremendous performer, when you could keep their extreme.ly long vector pipeline busy. However, one branch or gap in the vector processing pipeline would cause a flush of the vector unit and what performance advantage you appeared to have vs. a Cray 1S was quickly erased.

·         An early day accelerator was Floating Point Systems.  In particular the FPS 164 was an awesome “off load” system where the needs of a few users were satisfied with better throughput than the Cray X-MP and Y-MP of the day. Convex, had a better idea.  It was better at serving the needs of more than an FPS 164 and was simpler to develop, maintain and scale software to next generation systems.

So what are the lessons from history? Perhaps it is that there it is there is a tight connection between application, architectures and algorithms and that it is extremely important to maintain a level of application flexibility and versatility in order to adopt new architectures as they become available in the market.  The old adage still remains true, software will outlive the useful life of hardware.  So it is important to be able to quickly adapt new shifts.

The same questions probably still apply today as they did when Cray, CDC and FPS were around.

When does an accelerator computing strategy work best?

The easiest answer is if your application is extremely data parallel in nature, then it may be well suited for an accelerator strategy. The word extremely is the critical part. 

If your application only performs some level of data parallelism and includes task, thread and cluster level parallelism or contains a small fraction of branching or is host to irregular data sizes, then perhaps an accelerator may not be the best fit.

How much real performance will an accelerator strategy deliver? 

Often times we hear claims of 10X, 20X or even greater than 30X. 

These are great headlines, but as many have noted, you need to understand an accelerators impact on the total execution time of your application.  What may have been 10X to 30X or more on a kernel of the application may only deliver a mere 2X to 3X or even less in terms of total application performance improvement.

Of course the real question is what are we really comparing performance speed ups to?

I have seen well tuned software on accelerators compared to “baseline” code running on one core of an old processor.  However, when you use available software technology and turn compiler flags on and add in a math kernel library call the performance on multi-core solutions can jump by over 10X and in some cases can exceed 30X multiples for total execution time.  This standards based accelerated software will scale forward as newer microarchitectures are made available from Intel.

Why is the difference between the promise and the actual performance so great? 

Always a good questionJ. 

The promise deals with a small part or a kernel of the software that is data parallel and can potentially scale linearly as more compute resources are added.  Again if the application is extremely data parallel, then an accelerator strategy may be the correct approach.

However, when the actual performance result, or total application performance, is significantly different it is often because of several things. 

·         One common reason is that you may be comparing optimized software on multi-core systems to optimized software on an accelerator.  When I compare similarly optimized software on a multi-core system I see that 20 – 30X difference often fades to less than 2X  and in most cases better than hardware accelerators.  This is because optimized software on a multi-core solution accelerates all components of the application.

·         Another situation is the bandwidth imbalance of the attach points of the accelerators, typically the attach speeds do not match the memory bandwidth or the ALU speed on the accelerators and the theoretical peak flops are tough to achieve.  Sometimes, for larger workloads due to limited amount of memory on the accelerator card, performance deteriorates.

·         Another situation may be that your application depends on different forms of parallelism which include task, thread or cluster level parallelism and even in some cases sequential forms of your software

So back to the differences in performance between the Cray 1 and CDC Cyber 205.

While Cyber 205 was great at edges of science the Cray proved to be the workhorse of high performance computing.  It offered better system balance than the Cyber 205.  Here is an example, if you take great care to optimize your software for a particular architecture you will no doubt see tremendous performance gains.  However, like the Cyber 205, if you break that pipeline you need to pay for the overhead to restart the long vector pipeline.  Often times, even with today’s accelerators, that start up cost reduces what appears to be stellar performance gains of the Cyber 205 to being no better than, or sometimes, even slower than the Cray 1.  There were of course examples with the Cyber 205, as there is today with accelerators that demonstrate where select sciences can see tremendous advantages over traditional computing solutions.

What other considerations may weigh in your decision to adopt an accelerator strategy?

Are you constantly refining your software?
Many researchers would probably answer yes.  They are constantly refining their software to improve the results the performance or both.

As I mentioned at the beginning of the blog, the old adage still remains true, software will outlive the useful life of hardware.  So it is important to be able to quickly adapt to new shifts.  One way to simplify these moves is to use standards based tools which can give you the flexibility to create applications that can use the multiple types of parallelism mentioned above via tools, compilers, and libraries.  You may also want to use standards based tools to acquire the versatility you need in order to scale your software across multiple architectures – e.g. large, many and heterogeneous cores. 

The caveat with respect to using non standard tools is that you become locked into a specific architecture.  If that architecture from the same vendor would happen to change, you may be required to make some significant changes (e.g. tuning to grain sizes).

Do you want to maintain, support and update multiple code bases?
I don’t.  I want to invest n the development of parallel algorithms.  The old adage is that software will far out live any hardware implementation still applies and I need the flexibility and versatility to quickly and as painlessly as possible be able to adopt new architectures as they are made available.  I do not want to invest in maintaining, supporting and updating an ever increasing set of code streams as newer architectures are made available.

Our team goal at Intel is to develop software tools and hardware technology that can help you scale-forward your application performance to future platforms without requiring a massive rebuild – just drop-in a new runtime that is optimized for the new platform to experience the improvement (akin to the printer/display driver model, buy a new printer/display, install the respective driver, and your system enjoys improved benefits).  That is the goal.

If you want to learn more about what we are doing to deliver high performing HPC solutions that are both flexible and versatile please visit www.intel.com/go/hpc

1 Comments Permalink
3

Today Intel provided a server product update for the upcoming Nehalem-EX processor and the expandable platforms based on it.  Here’s a recap of some of the interesting messages communicated to the press:

 

  • Nehalem Architecture and Quick Path Architecture are coming to the EX (MP) segment, 4 Socket Servers and above. 
  • EX Servers are ideal for server consolidation / virtualized applications, data demanding enterprise applications and technical computing environments.  Both Itanium and Xeon processors based systems represent an attractive alternative to more expensive, proprietary RISC-processor based systems.
  • EX Servers are designed for the high-end.  They offer more capabilities (i.e. memory, RAS, cores/threads, sockets) than 2 Socket Servers that IT managers require for business drivers such as large scale server consolidation, high data demands, virtualization, and scalability.
  • Up to eight cores / 16 threads and a whopping 24MB of cache.
  • Up to 9x the memory bandwidth vs. today’s 4-Socket Xeon 7400.  The performance will be dramatic – the highest-ever jump from a previous generation processor. 
  • 2x the memory capacity with up to 16 memory slots per socket (that’s 64 DIMMs on a 4 Socket Server), and four high-bandwidth QuickPath Interconnect links.
  • New levels of scalability: from large memory 2 socket systems through 8 socket systems, and even more with OEM node controllers.  Matter of fact, there are over 15 8-Socket+ designs from 8 OEMs currently. 
  • IBM showed their 8S Nehalem-EX server design running 128 threads (8 Sockets x 8 cores x 2 threads due to Hyper Threading)…an industry first. 
  • New RAS features traditionally found on Itanium, such as Machine Check Architecture (MCA) Recovery which detects CPU, memory, and I/O errors, works with the OS to correct, and helps recover from otherwise fatal system errors. 
  • Nehalem-EX is scheduled for production in the second half of 2009, with OEM systems in early 2010.

 


Stay tuned over the next few days – we’ll post a video from the event.  Also look for some informative blogs over the next 1-2 weeks that will offer more of an in depth view of Nehalem-EX’s 4 Socket capabilities, performance, scalability, RAS, and Virtualization.

3 Comments Permalink
2

With 2S Xeon processors delivering outstanding energy efficient performance, I get many questions from customers on “what about the Dunnington-based Servers”, “would it make sense for me to continue to use 4S servers in my IT/Enterprise”, “am I making the right choices with Dunnington” and so on.

The answer is a yes, even considering our 2S Nehalem-EP based servers expected in Q1 2009.

Customers make choices based on their business requirements – whether the new platform would be able to meet their IT needs. And in making such a decision in today’s economic context, there is temptation to use a 2S server as it is costs less. However, the choice of the server platform should not be dictated by the price alone; there are other more important considerations that you have to look into before deciding on the right choice.

One of our customers in banking had a very interesting problem – the system should be capable of delivering a user-acceptable response time while maintain headroom for growth. And the customer was keen to use virtualization running his workloads. With these key parameters (among others) in mind, we sized both 2S and 4S servers for the customer. And it turned out, that both the 2S and 4S servers were able to meet the response time, but the 4S server could deliver the “headroom for growth”. The headroom was based on the bank’s projected business growth in the next 12-24 months, their existing datacenter facility (space, power, cooling) and opportunity to further consolidate applications in a virtualization environment. The bank could have easily settled on a multiple 2S servers running in a virtual pool, but when you factor in “headroom for growth” considerations, the 4S+ Xeon 7400 servers still deliver the performance scalability and expandability that is needed today and tomorrow.

The bank finally settled on ten 4S servers and ten 8S servers. These servers had the unique capability to scale (4S to 8S to 16S) within the same OS footprint (while keep costs under control) and also deliver the performance headroom for the bank’s further needs.

In doing the above exercise for the bank, simple guidelines emerged – a) select the right architecture that scales in the future, b) look at established OS/App technologies such as virtualization to consolidate the environment, c) make decisions based on “your workload” running on the new servers and use published benchmarks as indicatives only.

Let me know what you think – share your views.

2 Comments Permalink
0

 

One of the common questions that I get from customers is whether their applications will be able to take advantage of so many cores in their server. And it's not just running the application without changes, but also being able to scale in performance. I would like to address this concern in three parts:

 

 

 

  • What does many cores on a server bring to me? Applications that run on x86 servers today have been written to take advantage of more processors (or SMP) on the Servers since the 90s. These applications have been threaded over time and fined tuned to deliver optimum performance. When such applications are run on a many-core platform (such as the 4-way Intel(r) Xeon(r) 7400 processor-based server), these applications show instantaneous performance gain. Database applications such Oracle 10g, IBM DB2, Microsoft SQL Server 2005, Microsoft SQL Server 2008 have shown significant performance gains when run on a 6-core Xeon 7400 processor-based server. As an example, check out the TPC-C performance of the 4P platform. We have seen SQL Server 2005 performance gains up to 68% compared to the previous generation processor - The highest 4P database performance on a Windows Server Platform today. We also have seen the highest DB2 database performance on Linux OS. Similarly, using Oracle OASB benchmark with Oracle 10g R2 DB and Oracle E-business suite v12, the IBM x3850M2 delivered unparalleled processing of 10,000 employee payroll batch update in 5.37 seconds (Wall Clock Duration). So Clearly, the benefits of using multi-core processors such as the 6-core Xeon 7400 processor are immense. It's performance and more performance all the way for enterprise workloads.

 

 

  • Do I have to get the latest version of my application to get optimum performance? In most of the cases for enterprise workloads, the application gets performance boost as more cores are added to the system. However, the performance may not be optimum. As newer tools emerge to take advantage of Intel® Core® Architecture in a parallel environment, the ISVs may make changes to their software to give the their applications additional performance boost. Tools such as compilers and libraries from Intel, Microsoft, Oracle, Sun and others are constantly updated to provide optimum scaling and performance on Intel Architecture based Servers.

 

 

  • Does my application cost more on a 6-core Xeon 7400 processor-based server? In most of the cases, NO. You need to make an assessment of the software licensing model currently used on your servers and then decide if the price/performance is worth moving to the Xeon 7400 processor-based platform. From what we have seen in the past many years, the performance of a multi-core platform far outweighs the price of the platform. In short, YOU PAY LESS TO GET MORE. Today multi-core processing has become the norm for Enterprise Applications. ISVs are constantly evaluating their application licensing models to run in an SMP multi-core environment. ISVs such as Oracle, Microsoft, SAP, VMware have made their applications multi-core friendly, giving you more for less.

 

 

 

Another area that you MUST consider is Virtualization or Server Consolidation - where multi-core servers have been known to provide the optimum use of compute resources in your environment. You can read Virtualization benefits in blogs from Sudip Chahal, dave_hill, RK_Hiremane, and K_Lloyd

 

 

 

 

To summarize, enterprise applications running on the 6-core Intel Xeon 7400 processor-based servers will see performance scaling as the number of cores increase. And it WILL get better over time. I hope you have enjoyed reading this. Let me know what you think.

 

 

0 Comments Permalink
0

Day 1: I'm live from VMWorld this week experiencing the virtualization event of the year. I'll be updating this blog with happenings from the Intel booth and around the show floor. Some really cool video interviews with Intel Partners who are making a big impact in the virtualization world and giving IT managers real advantages over previous generation solutions. Here's a video showing the XEON 7400 near perfect scalability from 8 to 24 to 48 cores. Wow, 48 cores, that's cool!

 

 

 

 

If you liked the first video, check out this one where Jon Markee is talking about Intel Virtualization Technology (VT) and how flex priority improves performance and reduces boot time in your virtualized environment.

 

 

 

 

 

Day 2: Here's another video from the Intel Booth showing more examples of Intel Virtualization technology.

 

 

0 Comments Permalink

Filter Blog

By author: By date: By tag: