Home > Intel Communities > Open Port IT Community > The Server Room > Blog > Tags > performance_benchmark

The Server Room Blog

5 Posts tagged with the performance_benchmark tag
1

The need to write scalable applications has been important for programmers in the HPC community for years. Now, with the proliferation of multi/many-core processors developing scalable software is now a top priority for many programmers. 

Andrew S. Tanenbaum stated at the USENIX ’08 conference last year that developing “sequential programming is really hard” … the difficulty is “parallel programming is a step beyond that.” 

He is right, but let’s illustrate why it is just a small step.

Here is the point – parallel architectures will continue proliferating and we will need to develop and refine parallel algorithms that exploit parallelism. While difficult, to develop and refine parallel algorithms, the actual programming of these new algorithms, does not need to be hard.  However, if the developer is required to know the intimate details of the hardware then the development and refinement parallel algorithms can be very difficult, and very time consuming.

One approach provided by Intel software developer tools is to abstract away the details of the hardware.  This allows the developer to focus on their algorithms /applications, and rely on Intel software developer tools to provide the best optimizations for current and future platform While you may give up some performance by being abstracted away, what you lose in performance will be rewarded by your ability to quickly iterate through more iteration of your parallelization ideas in less time.  You may find yourself designing and developing better approaches to parallelism because you were able to test more hypotheses. 

An additional by-product of being abstracted away from having to know the intricacies of the hardware is that your software will be highly adaptable to future platforms.  You will see tremendous improvements on multi-core solutions and will be in a great position scale your application performance forward as newer architectures are made available. 

To learn more Intel Software Tools and the benefits of optimizing your software on multi core based solutions first visit http://software.intel.com/en-us/intel-sdp-home/

1 Comments Permalink
3

This blog post is meant to discuss some of the considerations for performance tuning your Intel® Xeon® Processor 5500 (“Nehalem-EP”) series based server. I’d like to do this by discussing the un-boxing process.

Step 1. Place the box on the floor

Step 2. Open the box

Step 3. Carefully remove the server.

Step 4. Plug the server into a keyboard, mouse, and monitor.

Step 5. Plug the server into the wall socket.

Step 6. Power on the server.

There. You are done tuning your Nehalem-EP based server for performance. “Really?” you ask? Well mostly. There are some considerations and I’ll discuss them. I can speak to this subject as I was asked to tune this class of system using the TPC-Cand TPC-Ebenchmarks.

BIOS / Firmware / Drivers

It is very important to remember to update your system's BIOS, firmware, and OS drivers before you do any deep performance tuning. I cannot over state the importance of this step. Your system's manufacturer should be able to provide the latest BIOS and firmware associated with your server. OS drivers are available through many sources these days. Typically these can be downloaded from OS vendors, hardware vendors, from the Linux open source community, or the platform's manufacturer.

A good example of this is the SATA driver associated with the ICH10. The ICH10 is part of the chipset that supports Nehalem-EP. I recommend going to Intel’s website and using the Intel Matrix Storage Manager driver for the SATA controller.

Understand your system

Last year, Nehalem launched for the desktop market segment. Now it is time for the server market. The Nehalem-EP processor is meant to be used in dual processor (DP) socket systems. Nehalem-EP is the follow on to the Intel Xeon Processor 5400 (“Harpertown”) series. However, Nehalem-EP is really very different from Harpertown. The Nehalem-EP processor is based on the Intel Core i7. The Nehalem-EP processor inherits the same architectural features as the Intel Core i7. Once you understand these features, then you can better tune your system for performance.

L3 Cache:

Nehalem-EP uses a level 3 cache. Depending on which SKU you are using it can be 4MB or 8MB in size. If you are interested in performance, then I would encourage you to pick the larger cache size SKU.

Hyper Threading Technology:

If some threads is good then more threads is better. This is where Hyper threading technologycomes in to play. Nehalem-EP provides this technology out of the box. So, on a typical DP server this will give your system 16 threads of processing goodness.

Intel Quick Path Interconnect:

Nehalem-EP supports a CPU interconnect known as Intel Quick Path Interconnect (QPI). This interconnect is the replacement for the Front Side Bus of old. QPI provides a point to point link to each of the processors and the Intel X58 chipset. The Nehalem-EP supports QPI speeds of up to 6.4GT/s. This provides a theoretical bandwidth of 25.6 GB/s. This is a welcome shift for Intel’s designs for the future.

Turbo Boost Technology:

As with the desktop SKUs of Nehalem, the Nehalem-EP supports Turbo Boost technology. This technology will run the CPU at a higher frequency than its rating. It will increase the frequency in steps of 133MHz until it achieves its upper thermal and electrical design requirements. Turbo Boost Technology is dynamic. In other words, the processor will decrease its core frequency if the temperature is too high. If your application is sensitive to core frequency changes and does not fully utilize all cores, then it may benefit from this technology.

Integrated Memory Controller:

Another key feature of Nehalem based processors is that they have the memory controller integrated into the processor. This allows for much lower memory latencies. The Nehalem-EP supports three channels of DDR3 memory. It is important to talk about DDR3 memory and population on Nehalem-EP based servers. As mentioned before Nehalem-EP supports three channels of memory and supports 800, 1067, and 1333 MT/s memory speeds. Those speeds are dependent on how many channels are populated with DIMMs. For instance, 1333 MT/s is supported in a single DIMM per channel configuration. 1067MT/s is supported in a single DIMM per channel and two DIMMs per channel configuration. 800MT/s is supported in all configurations. These speeds are based on dual ranked DIMMs. If you plan on filling up all the memory slots with as many DIMMS as possible you will end up running at 800MT/s. So, here is the consideration. Does your application need all that memory or could it use less memory running at a higher speed? If the answer is yes to the latter, then perhaps running two DIMMS per channel at 1067 MT/s is the best configuration.

To wrap things up here, we have looked at the new and Nehalem architecture, the importance of BIOS/ firmware/ OS drivers, and memory population. Your application's performance will vary, but I hope I have given you some things to narrow down your performance testing. Thanks for taking the time to read this blog post. For more great performance methodology tips please check out Shannon Cepeda’s blogposts on performance tuning.

3 Comments Permalink
0

The following are some considerations prior to tuning your MP Xeon 7400 series server. I can speak to this subject as I was asked to tune this system using the TPC-C and TPC-E benchmarks for internal measurements at Intel. While you may not be setting up thousands of hard disk spindles for your performance work, this blog post attempts to capture some of the key tuning considerations of this Xeon-based server.

 

Understand your system

The key to tuning any system, whether it is a formula one race car (I promise to stay away from silly car performance analogies) or a server is to understand it. Identify what components have an effect on performance and what components don't. This will narrow down your tuning efforts.

 

Architecture

Like all of Intel's platforms, an MP Xeon 7400 series server is made of several ingredients. Of course since I work for Intel I need to start out the ingredient list with the central processor. Our website has a good description of this processor here The MP Xeon 7400 processor is made up of three Core 2 Duo T5000/T7000 series processors. This provides six (yes six) cores for processing goodness. Each of the Core 2 Duo T5000/T7000 series processors provide 2 32KB level 1 caches (1 for data and 1 for code) and a 3MB level 2 unified cache. In addition to these two levels of cache the MP Xeon 7400 processor provides a 16MB level 3 unified cache. The other major ingredient to this platform is the Intel® 7300 Chipset. This chipset provides four independent front side bus links to the four CPU sockets. In addition, this chipset provides a snoop filter and four channels of FBD memory.

 

If some is good, then more is better:

The key thing to take away here is that an MP Xeon 7400 system fully populated with top bin processors will provide a whopping 24 cores of processing power in a four socket system. This is great for the enterprise benchmarks I use for performance testing as those applications are multithreaded and designed for multi-core processors. The same may not be true for your application, so please keep that in mind.

 

Another thing to remember is that an MP Xeon 7400 processor's design follows a growing pattern in the Xeon processor family. Specifically, I am referring to the addition of the level 3 cache (L3). This is also known as the last level cache (LLC). This follows the design of the Potomac (Xeon MP 64-bit) and Tulsa (7100-series) processors. The value of the large LLC is that it reduces the number of cache misses that would require the machine to go to FBD memory for the latest copy of a cache line. This additional level of on-chip cache comes at a price, though: higher latency. While the latency penalty is relatively low when compared to the latency to memory it is important to mention it here. Again, the LLC greatly benefits enterprise benchmarks I use for performance testing as they have a large memory footprint. The same may not be true for your application.

 

 

BIOS / Firmware / Drivers

It is very important to remember to update your system's BIOS, firmware, and OS drivers before you do any deep performance tuning. I can not over state the importance of this step. Your system's manufacturer should be able to provide the latest BIOS and firmware associated with your server. OS drivers are available through many sources these days. Typically these can be downloaded from OS vendors, hardware vendors, from the Linux open source community, or the platform's manufacturer.

 

Prefetchers

Intel processors have traditionally provided four prefetchers. These are accessible via model specific register IA32_MISC_ENABLE and sometimes via your OEMs BIOS. These features are meant to help the processor load data in a predictive manner to keep the cache hierarchy filled with the most pertinent cache lines. This is great if the application uses data in a somewhat predictable way. If your application uses cache lines in a random fashion, then the prefetchers may negatively impact performance. My best advice for you is to test your application with the prefetchers enabled and disabled. Table B-3 (MSR 0x1A0) in this link covers the prefetchers I am referring to.

 

Memory Population

As mentioned before, an MP Xeon 7400 series server will provide four channels of FBD memory. There are a couple of considerations here. First, latency to memory increases for every DIMM added to the system. This is important to note because you can keep the memory latency to a minimum by adding fewer high capacity DIMMs. Second, be sure to evenly distribute the DIMMs across all the channels. In other words, don't fill up all the slots on one channel and then lightly populate the rest.

 

An External Factor that may affect performance

Like many Intel designs, an MP Xeon 7400 series server will choose dishonor over death. I am referring to how it deals with high temperatures. The FBD memory inside an MP Xeon 7400 series server makes use of a thermal monitor on each DIMM. If the memory becomes too hot the chipset will begin to throttle memory bandwidth in an effort to reduce the temperature of the system. This will have a drastic negative impact to performance. So, keep your server room nice and cool.

 

 

 

 

To wrap things up here, we have looked at the architecture, the importance of BIOS/ firmware/ OS drivers, the prefetchers, memory population, and the effects of high temperatures. Your application's performance will vary, but I hope I have given you some things to narrow down your testing. So, by now you might be asking. "Where do I start?" Well not to be too self serving, but I would check out more of our blog posts here. A great place to start for performance methodologies would be Shannon Cepeda's blog. This series is a great resource for anyone interested in computer performance methodologies.

0 Comments Permalink
0

Here's the 2nd follow-up post in my 10 Habits of Great Server Performance Tuners series. This one focuses on the second habit: Start at the top.

 

Let me start by relating a true (although simplified) story. My team at Intel has built up years of expertise running a particular benchmark. So when the time came to start running a new, similar benchmark, we thought: "No problem." We began running tests while the benchmark was still in development. Immediately we had an issue: the type of problem that would normally indicate our hardware environment wasn't set up properly. We checked everything that we had seen cause the issue in the past, and we couldn't find anything. So, we blamed the new benchmark. After all, we were experts and we had been setting up these environments for years! We knew what we were doing. You can probably guess where this story is going: after weeks of doing things to work around the "benchmark issue", we figured out that we had mis-configured the environment, resulting in a bottleneck on one part of our testbed. We didn't thoroughly test that part of the environment because it had never caused us problems with the old benchmark. And of course, on the new benchmark it was critical. We had broken one of the most important rules of performance tuning: Start at the Top.

 

 

So now you know how easy it can be to not Start at the Top. Even seasoned performance engineers can get overconfident and forget this rule. But the consequences can be dire:

 

  • 1. You have to eat major crow when you realize your mistake. I'm just now getting over the humiliation.

  • 2. You might have put tunings in place to address issues that weren't really there. This is at best wasted work and at worst something that you have to painstakingly undo when you fix the real issue.

 

So...how do you avoid this situation? Simple: use the Top-Down Performance Tuning process. This means you start by tuning your hardware. Then you move to the application/workload, then to the micro-architecture (if possible). What you are looking for at each level are bottlenecks: situations where one component of the environment or workload is limiting the performance of the whole system. Your goal is to find any system-level bottlenecks before you move down to the next level. For example, you may find that your network bandwidth is bottlenecked and you need to add another NIC to your server. Or that you need to add another drive to your RAID array, or that your CPU load is being distributed un-evenly. Any bottlenecks involving your server system hardware (processors, memory, network, HBAs, etc), attached clients, or attached storage is a system-level bottleneck. Find these by using system-level tools (which I will touch on in the future blog for Habit #8), remove them, then proceed to the application/workload level and repeat the process.

 

 

 

Being vigilant about using the top-down process will ensure you don't waste time tuning a non-representative system. And it just may save you some embarrassment!

 

 

Always measure your bottlenecks!

 

 

Keep watching The Server Room for information on the other 8 habits in the coming weeks.

 

 

 

0 Comments 0 References Permalink
2

Take a look at the chart below ... it's telling you something... isn't it?

It's more than performance numbers and marketing, it's data... REAL data!

But what does it mean - and ultimately - how can you relate to it?

 

 

If you're really into high-powered computing, you're probably quite familiar with common benchmark data. With every new CPU release, there are tons of new statistics, models, and ways to test the increased performance of the newer technology device - in this case, the 45nm based CPUs just recently launched this month. But what exactly does all this data amount to? Reading benchmarks is more than just seeing a bar chart - there's a science to digging into the data...

 

First, lets take a step back for some of you who may not fully understand what benchmarking is for. Benchmarks help to provide a common ground for comparing the performance of various systems across different CPU/system architectures. A common set of instructions (or programs) are setup to run within a regulated guideline to ensure the testing is performed equally across the competing platforms or architectures. Very much like in sports, if you have two different runners - they run the same path - i.e. the 100 yard dash. This creates the comparative benchmark.

 

So let's get back to the latest hot stuff - the Intel Xeon 5400 Series and Core 2 Extreme QX9650 Quad Core based processors. In the past 18 months, computing models have taken a giant leap forward by adding more CPU's per socket thereby increasing the thread density of your platform. In dual socket systems, you used to have two threads you now have four or even eight! And in quad socket systems the count can go up to 16! You're increasing your capacity to perform computational data by a factor of 3 or 4 depending on the platform. This has made a tremendous change in how benchmarks have had to be setup to run and we have to evaluate the testing methods to ensure we're maximizing the computability of each platform.

 

There are a few key steps to take before you consider benchmarking your system:

  1. identify your problem area (processing power, network bandwidth, memory utilization, etc)

  2. identify your competing products

  3. evaluate the 'leaders' in your problem area

  4. survey for available benchmarking tools

  5. evaluate 'best practices' for testing (e.g. lower idle power based processors won't really help much if you're only doing high-end computing)

  6. and then - implement your findings in your chosen architecture(s)

 

In the high-end server space you usually see more vendor specific data rather than end-user testing. Primarily because of the finite set of data that server administrators are looking for. Many of these 'industry standards' are monitored for efficiency and ensure the end-user that the testing was properly performed and the results are repeatable:

 

Industry Standard Benchmarks

 

Intel uses many of these standards for benchmarking - as you can see here in the Xeon 5000 Series based Processors Benchmark Page

 

Even if you're a server admin, you most likely interact with clients for day to day performance as well. If you search the web for CPU benchmarks the most commonly viewed benchmarks are performed on the client side of computing, mainly because of a few factors:

 

  1. clients are usually cheaper and more abundant to test with

  2. visuals in client computing are usually more fun to watch than seeing SQL data fly across the screen (hey - just being honest here!)

  3. and servers in general are built for more specific reasons, whether it's application, storage, modeling or other specialties

 

Many of you have probably heard of benchmark sites such as: Anandtech, Toms Hardware, FiringSquad, HardOCP and many others (respond with your favorites please!) Each of these sites use common tools/applications to benchmark the latest and greatest hardware against each other. Depending on what you're looking to do with your hardware really determines what/how you want to benchmark your system (or look for data reviews for your configuration). After all, a machine that can run the latest games at over 60 frames per second may not be the best SQL server for your datacenter - right?

 

If you're looking for quick 'brute force' computational tools to try your hand at CPU benchmarking, try something simple like BOINC, Super PI, or you can get more elaborate by using some methods as described by C-Net by using Cinebench, or SiSoftware Sandra. Once you've figured out some of the basics - and can repeat these simpler tests - you can jump into those Industry Standards and get into some serious work!

 

So in closing, there are so many variables to account for when looking to validate the performance of a given system. Processor speeds, I/O subsystem configuration, memory latencies, network bandwidth, power utilization, etc... the permutations are nearly endless. So you have to be diligent in initially addressing your key problem(s), and attack the solution in benchmarking using the best known methods. Also, when reading benchmark information BE SURE to read the configurations of the systems in question - are they truly comparable? are the components running at spec level or overclocked? Are the speed differences negligible, or substantial in real-world evaluation? And finally, focus on what's important to you and your computing requirements - after all, you need to be sure you've picked the correct system for your needs.

2 Comments Permalink

Filter Blog

By author: By date: By tag: