Home > Intel Communities > Open Port IT Community > The Server Room > Blog > 2009 > July > 08
1

The need to write scalable applications has been important for programmers in the HPC community for years. Now, with the proliferation of multi/many-core processors developing scalable software is now a top priority for many programmers. 

Andrew S. Tanenbaum stated at the USENIX ’08 conference last year that developing “sequential programming is really hard” … the difficulty is “parallel programming is a step beyond that.” 

He is right, but let’s illustrate why it is just a small step.

Here is the point – parallel architectures will continue proliferating and we will need to develop and refine parallel algorithms that exploit parallelism. While difficult, to develop and refine parallel algorithms, the actual programming of these new algorithms, does not need to be hard.  However, if the developer is required to know the intimate details of the hardware then the development and refinement parallel algorithms can be very difficult, and very time consuming.

One approach provided by Intel software developer tools is to abstract away the details of the hardware.  This allows the developer to focus on their algorithms /applications, and rely on Intel software developer tools to provide the best optimizations for current and future platform While you may give up some performance by being abstracted away, what you lose in performance will be rewarded by your ability to quickly iterate through more iteration of your parallelization ideas in less time.  You may find yourself designing and developing better approaches to parallelism because you were able to test more hypotheses. 

An additional by-product of being abstracted away from having to know the intricacies of the hardware is that your software will be highly adaptable to future platforms.  You will see tremendous improvements on multi-core solutions and will be in a great position scale your application performance forward as newer architectures are made available. 

To learn more Intel Software Tools and the benefits of optimizing your software on multi core based solutions first visit http://software.intel.com/en-us/intel-sdp-home/

1 Comments Permalink
0

Your most valuable employee is the one that creates tomorrow’s successes.  Providing them tools that help them do that faster will help your organization create new products or optimize old ones more rapidly.  The benefit to the organization is increased opportunities to win the customer’s attention via new products or your responsiveness to their request; the employee gets to brag on what he or she just helped bring to market.

Before we get too far let’s look at Intel’s mission with respect to workstations.  We are laser focused on supplying technology that provides users with an uncompromised experience in transforming their ideas into reality.  With that in mind we look at how users create; we try understanding their obstacles and work with the ecosystem of hardware and software providers to deliver solutions to real problems that may be inhibiting their opportunity to innovate.  

One technology that is helping users innovate faster is virtualization. 

No, we are not looking to remove the workstation from the user’s desk or share his or her workstation with peers, who also need a workstation.  We are using virtualization to deliver the performance they need to innovate faster.

The Observation

We saw workstation user’s innovation slow as they multitasked between tasks – some of them not even theirs.  The involuntary tasks included deploying IT security patches, updates, and system backups to name a few.  We also saw that users were no longer just doing Computer Aided Design (CAD) alone, but they were doing CAD, using productivity tools, meshing, web surfing for supporting facts, collaborating via video and Instant Messaging (IM) tools, digital white boarding and trying to do analysis-driven design.  They were very busy people who can’t afford any downtime or slow time.

In some cases we noticed that some users actually had not one, but two or more workstations running in completely different environments, many times with different OSs.

The Problem

What the above really lead to is a conclusion that too many tasks were going after too few resources and that the experience we had hoped the user would encounter was not happening.  In fact the reverse was happening – interactive creative tasks were slowing, system sluggishness was at an all time high.  The “uncompromised experience in transforming their ideas into reality” we wanted for a workstation user was not there and any innovation that was possible was slowed down to a crawl.

A Potential Solution

Intel® Virtualization Technology for Directed I/O (Intel VT-d), once just thought of for servers actually has a place in the workstation market. 

This technology provides an important step toward enabling a significant set of emerging usage models in the workstation. VT-d support on Intel platforms provides the capability to ensure improved isolation of I/O resources for greater reliability, security, and availability.  That is a mouth full let’s see it in action.

There are two key requirements that are common across workstation usage models.

1.       The first requirement is protected access to I/O resources from a given virtual machine (VM), such that it cannot interfere with the operation of another VM on the same platform. This isolation between VMs is essential for achieving availability, reliability, and trust. This helps you get the performance you want from your workstation.

2.       The second major requirement is the ability to share I/O resources among multiple VMs. In many cases, it is not practical or cost-effective to replicate I/O resources (such as storage or network controllers) for each VM on a given platform.

In the case of the workstation, virtualization can be used to create a self-contained operating environment, or "virtual software appliance[RC1] ," that is dedicated to capabilities such as manageability or security. These capabilities generally need protected and secure access to a network device to communicate with down-the-wire management agents and to monitor network traffic for security threats. For example, a security agent within a VM requires protected access to the actual network controller hardware. This agent can then intelligently examine network traffic for malicious payloads or suspected intrusion attempts before the network packets are passed to the guest OS, where user applications might be affected. Workstations can also use this technique for management, security, content protection, and a wide variety of other dedicated services. The type of service deployed may dictate that various types of I/O resources, graphics, network, and storage devices, be isolated from the OS where the user's applications are running.

The Result

In collaborating with virtualization and automation leader, Parallels, on its Parallels Workstation Extreme solution,  we identified two impediments to workstation user productivity.  The first was the issue around general resource overhead that afflict a traditional virtualized workstation system due to  insufficient resources to address the overload of requests. The second issue explored includes the more complex problem of a single workstation with the need to support multiple OSs and display visualization programs at near- or full-performance within virtualized machines.

The first issue was more straightforward - create VMs, partition resources and now the user has a very resilient workstation that is capable of delivering the intended experience.  IT can have their VMs and the user has his or her workstation back and the concept of digital prototyping to create and explore a complete product before it is built is a reality.  The creative innovator in the company can now iterate through more ideas in less time and your company created more opportunities to catch the customer’s attention just went through the roof.

The second issue offered a more complex challenge.  We identified certain industries such as the oil and gas exploration space where users actually had two or more physical workstations - one running Windows, the other running Linux. Both workstations had visual display requirements by the end user and both computers acted on the same reservoir data with applications that while similar in many ways, were still different in their functionalities and purpose.  In oil drilling projects that typically involve millions of dollars in capital investment, the confirmation of expected end results is an asset that far outweigh the costs of a few workstations. Nevertheless, in today’s economic setting, the ability to get the same functionalities at a lower cost is one of many key drivers in helping companies achieve healthy bottom lines.

The Proof Point For Virtualization In A Workstation Engineers from Schlumberger, a leading oil field service provider, run performance-demanding applications such as GeoFrame* and Petrel*.  These applications serve to analyze complex geological and geophysical data and determine the viability of potential reservoirs, or to optimize production at existing sites. With GeoFrame running on Linux* and Petrel on Microsoft Windows*, Schlumberger engineers have been using these applications on two separate physical workstations, driving IT spending higher, pushing down user productivity and increasing both power consumption and IT maintenance costs.

A New Paradigm For A New Day

With the availability of Intel Xeon processor 5500 series-based workstations, game-changing workstation virtualization software such as Parallels Workstation Extreme has opened up new horizons with breakthrough graphics performance with Intel’s latest processor technology. Parallels Workstation Extreme is built on top of the Parallels FastLane Architecture that effectively leverages the full potential of hardware resources such as graphics and networking cards to offer optimal workstation performance.

In comparison testing, Schlumberger compared the concurrent performance of applications running side-by-side on a virtualized Intel Xeon processor 5400 series-based workstation with the same setup on the newer Intel Xeon processor 5500-based machine. The results were astounding. The first machine with the older processor without Intel-VT-d support ran Petrel on the host OS at full native speed, but performance for GeoFrame in a VM slowed enormously. While Petrel refreshed its graphics at a rate of 30 frames per second, GeoFrame crawled along at a graphics refresh rate of JUST one frame every 19 seconds, an agonizingly slow performance on an older workstation without Intel VT-d support.

When the group tested the same applications on the newer Xeon 5500 series workstation with Intel VT-d support, the results were striking: Both applications – Petrel running on the host OS and GeoFrame in a guest OS in a VM - ran at full native speed, and both were able to refresh graphics at near 30 frames per second—a 570 times improvement over the first workstation.

Russ Sagert, Schlumberger’s Geoscience Technical Advisor for North America said “our engineers were blown away by the performance. We hammered these machines with extreme workloads that stressed every aspect of the system. Amazingly, the new workstation based on the Intel Xeon processor 5500 series provided performance enabling this multiple OS, multiple application environment for the first time.”

The key element in Schlumberger’s new environment is Intel Xeon processor 5500 series-based workstations with Intel® Virtualization Technology (Intel® VT) for Directed I/O (Intel® VT-d).  Together, these technologies enable direct assignment of graphics and network cards to virtual machines, enabling the machine to circumvent the interrupt and exit loop and clearing the previous performance problems.

Running in conjunction with Parallels Workstation Extreme, which effectively leverages Intel Virtualization Technology, including VT-d, the solution revolutionizes virtualization for high-end users. “High-performance virtualization on Intel Xeon processor 5500 series-based workstations is a game-changing capability,” says Sagert. “We can allocate multiple cores, up to 64 GB of memory and a dedicated graphics card to each machine. The results are spectacular.”

In the final analysis, moving to the Intel Xeon Processor 5500 series of next-generation workstations does far more than cut costs. It impacts the way that work gets done. If you have clients running the kind of resource-intensive, graphics-rich applications that traditionally slow to a crawl in a virtualized environment, consider the benefits of finally moving beyond the I/O barrier.

A fully configured Intel Xeon Processor 5500 series-based workstation running Parallels Workstation Extreme delivers the performance level that makes a virtualized workstation a leading contender for users with multi-workstation requirements. A streamlined work interface, reduced office noise and clutter, access to the same data repository and significant performance gains works on the user side. But the IT organization also gains benefits by lowering capital, management, support, provisioning, data protection, space, and energy and cooling costs.

Moreover, the IT team can now standardize on a single OS image while addressing alternative requirements.

Learn More

Intel Workstation Processors http://www.intel.com/products/workstation/processors/index.htm

Parallels Workstation Extreme

http://www.parallels.com/products/extreme


#

[RC1]To distinguish from the hardware appliance breed

0 Comments Permalink
1

I have been around the supercomputing market for over 25 years and have had an opportunity to see some interesting ideas come and go.  Let me share two that I experienced firsthand. 

·         CDC’s Cyber205 or a Cray 1S.  The CRAY-IS and the CDC Cyber 205 both offered effective vector processing, however, code conversion between them may have required some significant algorithmic changes. Cray of course won the HPC race at that time.  Note, the Cyber 205 was a tremendous performer, when you could keep their extreme.ly long vector pipeline busy. However, one branch or gap in the vector processing pipeline would cause a flush of the vector unit and what performance advantage you appeared to have vs. a Cray 1S was quickly erased.

·         An early day accelerator was Floating Point Systems.  In particular the FPS 164 was an awesome “off load” system where the needs of a few users were satisfied with better throughput than the Cray X-MP and Y-MP of the day. Convex, had a better idea.  It was better at serving the needs of more than an FPS 164 and was simpler to develop, maintain and scale software to next generation systems.

So what are the lessons from history? Perhaps it is that there it is there is a tight connection between application, architectures and algorithms and that it is extremely important to maintain a level of application flexibility and versatility in order to adopt new architectures as they become available in the market.  The old adage still remains true, software will outlive the useful life of hardware.  So it is important to be able to quickly adapt new shifts.

The same questions probably still apply today as they did when Cray, CDC and FPS were around.

When does an accelerator computing strategy work best?

The easiest answer is if your application is extremely data parallel in nature, then it may be well suited for an accelerator strategy. The word extremely is the critical part. 

If your application only performs some level of data parallelism and includes task, thread and cluster level parallelism or contains a small fraction of branching or is host to irregular data sizes, then perhaps an accelerator may not be the best fit.

How much real performance will an accelerator strategy deliver? 

Often times we hear claims of 10X, 20X or even greater than 30X. 

These are great headlines, but as many have noted, you need to understand an accelerators impact on the total execution time of your application.  What may have been 10X to 30X or more on a kernel of the application may only deliver a mere 2X to 3X or even less in terms of total application performance improvement.

Of course the real question is what are we really comparing performance speed ups to?

I have seen well tuned software on accelerators compared to “baseline” code running on one core of an old processor.  However, when you use available software technology and turn compiler flags on and add in a math kernel library call the performance on multi-core solutions can jump by over 10X and in some cases can exceed 30X multiples for total execution time.  This standards based accelerated software will scale forward as newer microarchitectures are made available from Intel.

Why is the difference between the promise and the actual performance so great? 

Always a good questionJ. 

The promise deals with a small part or a kernel of the software that is data parallel and can potentially scale linearly as more compute resources are added.  Again if the application is extremely data parallel, then an accelerator strategy may be the correct approach.

However, when the actual performance result, or total application performance, is significantly different it is often because of several things. 

·         One common reason is that you may be comparing optimized software on multi-core systems to optimized software on an accelerator.  When I compare similarly optimized software on a multi-core system I see that 20 – 30X difference often fades to less than 2X  and in most cases better than hardware accelerators.  This is because optimized software on a multi-core solution accelerates all components of the application.

·         Another situation is the bandwidth imbalance of the attach points of the accelerators, typically the attach speeds do not match the memory bandwidth or the ALU speed on the accelerators and the theoretical peak flops are tough to achieve.  Sometimes, for larger workloads due to limited amount of memory on the accelerator card, performance deteriorates.

·         Another situation may be that your application depends on different forms of parallelism which include task, thread or cluster level parallelism and even in some cases sequential forms of your software

So back to the differences in performance between the Cray 1 and CDC Cyber 205.

While Cyber 205 was great at edges of science the Cray proved to be the workhorse of high performance computing.  It offered better system balance than the Cyber 205.  Here is an example, if you take great care to optimize your software for a particular architecture you will no doubt see tremendous performance gains.  However, like the Cyber 205, if you break that pipeline you need to pay for the overhead to restart the long vector pipeline.  Often times, even with today’s accelerators, that start up cost reduces what appears to be stellar performance gains of the Cyber 205 to being no better than, or sometimes, even slower than the Cray 1.  There were of course examples with the Cyber 205, as there is today with accelerators that demonstrate where select sciences can see tremendous advantages over traditional computing solutions.

What other considerations may weigh in your decision to adopt an accelerator strategy?

Are you constantly refining your software?
Many researchers would probably answer yes.  They are constantly refining their software to improve the results the performance or both.

As I mentioned at the beginning of the blog, the old adage still remains true, software will outlive the useful life of hardware.  So it is important to be able to quickly adapt to new shifts.  One way to simplify these moves is to use standards based tools which can give you the flexibility to create applications that can use the multiple types of parallelism mentioned above via tools, compilers, and libraries.  You may also want to use standards based tools to acquire the versatility you need in order to scale your software across multiple architectures – e.g. large, many and heterogeneous cores. 

The caveat with respect to using non standard tools is that you become locked into a specific architecture.  If that architecture from the same vendor would happen to change, you may be required to make some significant changes (e.g. tuning to grain sizes).

Do you want to maintain, support and update multiple code bases?
I don’t.  I want to invest n the development of parallel algorithms.  The old adage is that software will far out live any hardware implementation still applies and I need the flexibility and versatility to quickly and as painlessly as possible be able to adopt new architectures as they are made available.  I do not want to invest in maintaining, supporting and updating an ever increasing set of code streams as newer architectures are made available.

Our team goal at Intel is to develop software tools and hardware technology that can help you scale-forward your application performance to future platforms without requiring a massive rebuild – just drop-in a new runtime that is optimized for the new platform to experience the improvement (akin to the printer/display driver model, buy a new printer/display, install the respective driver, and your system enjoys improved benefits).  That is the goal.

If you want to learn more about what we are doing to deliver high performing HPC solutions that are both flexible and versatile please visit www.intel.com/go/hpc

1 Comments Permalink
0

Yes.

 

I had the recent opportunity to work on this case study published jointly by Intel, Dell and Motion Computing that reviewed how information technology investment by Correctional Health Services Corporation in Puerto Rico drove a transformation of their health services in their prison system.

 

There are tons of case studies out in market and web but to me this one stood out in it's dramatic impacts from improved efficiency of employees and workers at the prison, improved health care of inmates, the ability to meet minimum documentation standards, and a lowering of costs to manage the IT infrastructure.

 

If you read one case study this year .. this one is recommend.  Definitely a feel good story all around. http://www.intel.com/references/pdfs/Correctional_Health_casestudy_LRs.pdf

 

Chris

http://www.twitter.com/chris_p_intel

0 Comments Permalink

Filter Blog

By author: By date: By tag: