Home > Intel Communities > Open Port IT Community > The Server Room > Blog > Tags > performance_tuning
1 2 Previous Next

The Server Room Blog

27 Posts tagged with the performance_tuning tag
0

Here’s the 8th follow-up post in my 10 Habits of Great Server Performance Tuners series. This one focuses on the eighth habit: Use the Right Tool for the Job.

IMG_2361_noExif.JPG

There are many different reasons why people undertake performance analysis projects. You could be looking to fine-tune your compiler-generated assembly code for a particular CPU, trying to find I/O bottlenecks on a distributed server application, or trying to optimize power performance on virtual server, just to name a few. As I discussed in habit 2, there are also different levels where you can focus your investigation – mainly the system, application, and macro or micro-architecture levels.

It can be overwhelming thinking of all the different ways to collect and analyze data and trying to figure out which methods apply to your particular situation. Luckily there are tools out there to fill most needs. Here are some of the things you should consider when trying the find the tool(s) that are right for you.

 

  1. Environment – Many tools work only in specific environments. Think about your needs – are you going to be performing analysis in a Windows or Linux environment, or both? If you are analyzing a particular application, is it compiled code, Java*, or .NET* based? Is the application parallel? Are you running in a virtual environment?
  2. Layer – Will you be analyzing at the system, application, or micro-architecture level, or all 3 ? At the system level, you are focusing primarily on things external to the processor – disk drives, networks, memory, etc. At the application level you are normally focused on optimizing a particular application. At the micro-architecture level you are interested in tuning how code is executed on a particular processor’s pipeline. Each of these necessitates a different approach.
  3. Software/Hardware Focus – Finally consider whether you will mainly be tuning the software or the hardware (platform and peripherals) or both. If you plan to do code optimization, you will need a tool with a development focus.
  4. Sampling/Instrumentation - For software optimization tools in particular, there are 2 main methods used to collect data. Sampling tools periodically gather information from the O/S or the processor on particular events. Sampling tools generally have low overhead, meaning they don’t significantly increase the runtime of the application(s) being analyzed. Instrumentation tools add code to a binary in order to monitor things like function calls, time spent in particular routines, synchronization primitives used, objects accessed, etc. Instrumentation has a higher overhead, but can generally tell you more about the internals of your application.

After determining your specific needs, take a look at the tools out there (you might start with the lists available on wikipedia or at HP’s Multicore Toolkit.) Of course I recommend you also check out the Intel® Software Development Products. There are several specifically for performance analysis:

  • Intel® VTune™ Performance Analyzer works in both Windows* and Linux* environments and provides both sampling and an instrumented call graph. The sampling functionality can be used to perform analysis at all levels – system, application, and micro-architecture. It is multi-core aware, supports Java* and .NET* and also allows developers to identify hot functions and pin-point lines of code causing issues.
  • Intel® Thread Profiler is supported on Windows*. It is a developer-focused tool that uses instrumentation to profile a threaded application. It supports C, C++, and Fortran applications using native threading, OpenMP*, or Intel® Threading Building Blocks. Intel® Thread Profiler can show you concurrency information for your application and help you pinpoint the causes of thread-related overhead.
  • Intel® Parallel Amplifier Beta plugs into Microsoft* Visual Studio and allows C++ developers to analyze the performance of applications using native Windows* threads. It uses sampling and a low-overhead form of instrumentation to show you your applications hot functions, concurrency level, and synchronization issues.

Finding the right tool for your situation can greatly reduce frustration and the time needed to complete your project.  Good luck, and keep watching The Server Room for information on the last 2 habits in the coming months.

0 Comments Permalink
2

Here’s the 7th follow-up post in my 10 Habits of Great Server Performance Tuners series. This one focuses on the seventh habit: Document and Archive.

IMG_3773_200x300.jpg

I hope the reason why you need to document and retain data for any performance project is understood, so I won’t go into it. Nor will I recommend particular documentation solutions – just find a database or filing solution you like that gets the job done. What I will do is list what needs to be documented.

Normally, performance tuning consists of iterating through experiments. So, for each experiment, it is important to document:

  • What changes were made – hopefully you weren’t trying too many things at once!
  • The purpose – why you tried this particular thing (including who requested it, if appropriate
  • General information – date & location of testing, person conducting the test
  • Hardware configuration:
    • Platform hardware and version, BIOS version, relevant BIOS option settings
    • CPU model used, number of physical processors, number of cores per processor, frequency, cache size information, whether Hyper-Threading was used (cpu-z can help document all this)
    • Memory configuration – number of DIMMs and capacity per DIMM, model number of DIMMs used
    • I/O interfaces – model number of all add-in cards, slot number for all add-in cards, driver version for all devices (on Windows*, msinfo can help with this, on Linux*, lspci)
    • Any other relevant hardware information, such as NIC settings, external storage configuration, external clients used, etc if it affects your workload
  • Software configuration:
    • Operating System used, version, and service pack/update information (use msinfo on Windows systems, uname on Linux systems)
    • Version information for all applications relevant to your workload
    • Compiler version and flags used to build your application (if you are doing software optimization)
    • Any other relevant software information, such as third-party libraries, O/S power utilization settings, pagefile size, etc if it affects your workload
  • Workload configuration:
    • Anything relevant to how your experiment/application was run, for example, your application’s startup flags, your virtualization configuration, benchmark information, etc
  • Results and data - naturally you would store all the above information along with the results and data that accompany your experiment

This blog entry is also the appropriate place to for me to mention the role of automation in your tuning efforts. If you are going to be doing a significant number of experiments, invest the energy needed to set up an automation infrastructure – a way to run your tests and collect the appropriate data without human attention. I included links to automated ways to gather the above data where appropriate.

Keep watching The Server Room for information on the other 3 habits in the coming months.

2 Comments Permalink
0

The following are some considerations prior to tuning your MP Xeon 7400 series server. I can speak to this subject as I was asked to tune this system using the TPC-C and TPC-E benchmarks for internal measurements at Intel. While you may not be setting up thousands of hard disk spindles for your performance work, this blog post attempts to capture some of the key tuning considerations of this Xeon-based server.

 

Understand your system

The key to tuning any system, whether it is a formula one race car (I promise to stay away from silly car performance analogies) or a server is to understand it. Identify what components have an effect on performance and what components don't. This will narrow down your tuning efforts.

 

Architecture

Like all of Intel's platforms, an MP Xeon 7400 series server is made of several ingredients. Of course since I work for Intel I need to start out the ingredient list with the central processor. Our website has a good description of this processor here The MP Xeon 7400 processor is made up of three Core 2 Duo T5000/T7000 series processors. This provides six (yes six) cores for processing goodness. Each of the Core 2 Duo T5000/T7000 series processors provide 2 32KB level 1 caches (1 for data and 1 for code) and a 3MB level 2 unified cache. In addition to these two levels of cache the MP Xeon 7400 processor provides a 16MB level 3 unified cache. The other major ingredient to this platform is the Intel® 7300 Chipset. This chipset provides four independent front side bus links to the four CPU sockets. In addition, this chipset provides a snoop filter and four channels of FBD memory.

 

If some is good, then more is better:

The key thing to take away here is that an MP Xeon 7400 system fully populated with top bin processors will provide a whopping 24 cores of processing power in a four socket system. This is great for the enterprise benchmarks I use for performance testing as those applications are multithreaded and designed for multi-core processors. The same may not be true for your application, so please keep that in mind.

 

Another thing to remember is that an MP Xeon 7400 processor's design follows a growing pattern in the Xeon processor family. Specifically, I am referring to the addition of the level 3 cache (L3). This is also known as the last level cache (LLC). This follows the design of the Potomac (Xeon MP 64-bit) and Tulsa (7100-series) processors. The value of the large LLC is that it reduces the number of cache misses that would require the machine to go to FBD memory for the latest copy of a cache line. This additional level of on-chip cache comes at a price, though: higher latency. While the latency penalty is relatively low when compared to the latency to memory it is important to mention it here. Again, the LLC greatly benefits enterprise benchmarks I use for performance testing as they have a large memory footprint. The same may not be true for your application.

 

 

BIOS / Firmware / Drivers

It is very important to remember to update your system's BIOS, firmware, and OS drivers before you do any deep performance tuning. I can not over state the importance of this step. Your system's manufacturer should be able to provide the latest BIOS and firmware associated with your server. OS drivers are available through many sources these days. Typically these can be downloaded from OS vendors, hardware vendors, from the Linux open source community, or the platform's manufacturer.

 

Prefetchers

Intel processors have traditionally provided four prefetchers. These are accessible via model specific register IA32_MISC_ENABLE and sometimes via your OEMs BIOS. These features are meant to help the processor load data in a predictive manner to keep the cache hierarchy filled with the most pertinent cache lines. This is great if the application uses data in a somewhat predictable way. If your application uses cache lines in a random fashion, then the prefetchers may negatively impact performance. My best advice for you is to test your application with the prefetchers enabled and disabled. Table B-3 (MSR 0x1A0) in this link covers the prefetchers I am referring to.

 

Memory Population

As mentioned before, an MP Xeon 7400 series server will provide four channels of FBD memory. There are a couple of considerations here. First, latency to memory increases for every DIMM added to the system. This is important to note because you can keep the memory latency to a minimum by adding fewer high capacity DIMMs. Second, be sure to evenly distribute the DIMMs across all the channels. In other words, don't fill up all the slots on one channel and then lightly populate the rest.

 

An External Factor that may affect performance

Like many Intel designs, an MP Xeon 7400 series server will choose dishonor over death. I am referring to how it deals with high temperatures. The FBD memory inside an MP Xeon 7400 series server makes use of a thermal monitor on each DIMM. If the memory becomes too hot the chipset will begin to throttle memory bandwidth in an effort to reduce the temperature of the system. This will have a drastic negative impact to performance. So, keep your server room nice and cool.

 

 

 

 

To wrap things up here, we have looked at the architecture, the importance of BIOS/ firmware/ OS drivers, the prefetchers, memory population, and the effects of high temperatures. Your application's performance will vary, but I hope I have given you some things to narrow down your testing. So, by now you might be asking. "Where do I start?" Well not to be too self serving, but I would check out more of our blog posts here. A great place to start for performance methodologies would be Shannon Cepeda's blog. This series is a great resource for anyone interested in computer performance methodologies.

0 Comments Permalink
1

If there is one thing that has stayed consistent in the computing industry over time, it's that performance doesn't stand still. As our computing platform processing, I/O, and memory speeds continue to accelerate, it is important to remember a little thing called latency.

 

Often in the Ethernet world throughput is the 1st and last performance metric of choice. 1 Gigabit and 10 Gigabit are the numbers that inspire thoughts of increased performance, and improved computing power. However, it's important to note that, in many applications, the transaction latency over the wire is really the key to unlocking high performance at the system level. One of the primary reasons that some organizations have turned to Infiniband and other I/O technologies for HPC and clustering in the past has to do with their desire to achieve very low latencies, not necessarily increased throughput. If you look at a historical standard Gigabit Ethernet connection, you may see latencies that are around 125μs. This may have been ok in the past, but as improvements at the application level as well in the system hardware and CPU take hold, legacy Ethernet won't be good enough for HPC and clustering environments.

 

 

The interesting, and often overlooked fact with Ethernet is that the latency characteristics are improving as the industry moves from 1 Gigabit to 10 Gigabit. The faster throughput on the wire comes along with lower latency to some extent, but in addition, there have been several improvements in interrupt handling that drastically improve overall latencies when comparing legacy 1Gigabit to 10Gigabit. With a basic 1st generation Intel® 10Gigabit CX4 card you can now see latencies approach 25μs without any special tuning.

 

 

What's even better is that Intel's 10 Gigabit networking silicon also has further enhancements for improving latency by introducing some new specialized Low Latency Interrupt (LLI) filters in the silicon. These filters provide the hardware with a quicker reaction time to network packets that meet certain customizable criteria. The filters can be tuned to have a rapid response to certain packet and traffic types. With these kinds of LLI filters in place, latencies can be reduced further by another ~50% to ~14μs.

 

 

Going forward with 10 Gigabit there are new technologies and designs that can help push latency even lower to the sub-10μs threshold to keep Ethernet very competitive as a fabric not only from a cost and throughput perspective, but also from the perspective of latency.

 

 

And while lower latency is certainly important, the last piece that was really missing from the Ethernet performance puzzle was not just low latency, but deterministically low latency. The key is that the worst case packet latencies for many applications are relevant and very important. By application thread affinitization, the individual data thread can be piped directly between a network queue and a CPU core. By more evenly distributing the networking workload between CPU cores in a predictable fashion, you get a deterministic kind of latency that does not stray far from the average assuming CPU cores do not get oversubscribed. Average latency of ~14μs is good, but the fact that you can get this with reasonable determinism is a key for many applications and usages.

 

 

Now, lower, deterministic latency is not just a theoretical benefit for certain niche applications. Decreasing latency and improving overall latency characteristics while increasing throughput directly benefits the transaction rates that can be achieved with real world applications. As an example of the improved performance is the latest Reuter Market Data Systems (RMDS) benchmarks done by STACResearch on the 4-way Intel® Xeon E7450 (Dunnington) using the Intel® 82598EB 10 Gigabit AT Dual Port networking adapter. The testing showed the Highest Point-to-Point Server throughput to date on a single server in testing done by STAC. And total updates per second reached over 15 million. Financial Service industry administrators: I can see you drooling...

 

 

Latency and throughput numbers are great to talk about, but at the end of the day, real world application performance on real systems is the key. While there will always be a small subset of the high end server market that needs the absolute lowest latencies provided by Infiniband; 10 Gigabit Ethernet is gaining ground while maintaining its place as the default fabric of choice for multiple applications and traffic types. I believe the best is yet to come as newer, faster, and more responsive technologies continue to roll out.

 

 

Ben Hacker

1 Comments Permalink
0

Here's the 6th follow-up post in my 10 Habits of Great Server Performance Tuners series. This one focuses on the sixth habit: Try 1 Thing at a Time.

 

 

Like habit 2, Start at the Top, this habit looks easy to understand and to keep. But, due to the constant desire for productivity, I and most others I know in the performance community have broken it many times. Some times I even get away with it. But trying to keep this habit is important, because when I don't get away with it, breaking this rule results in even more work than I was trying to save.

 

 

The concept behind this habit is simple - when you are optimizing your platform or your code, make only one change at a time. This allows you to measure the effect of each change, and only accumulate the positive changes (however small) into your workload. I have seen instances, for example, where 2 small changes applied at the same time to a workload cancelled each other out: one caused a small in performance and the other a small increase. If these changes weren't tested individually, we would have missed out on that performance gain.

 

 

Another thing that can happen in a complex workload is that two changes that seem independent can interact with each other. Like many developers know from fixing bugs, changing one thing may affect something else. Keeping all your changes separate can help you identify these interactions more easily.

 

 

You may be wondering when it is acceptable to break this habit. I think of performance methodology, and this rule in particular, as similar to the scientific method we learned in school. It's always good to follow it - doing so will help you quantify your successes and failures, stay organized, and defend your conclusions - but, you can still make a big breakthrough without it. In some cases, like when you are making small local changes to source code in completely different modules, or when you are changing two things you are certain won't interact, the habit can be broken. But the advice I give, especially to those involved in long-term optimization projects, is to follow it.

 

 

What has your experience been? Please share your "changing multiple things at one time" stories.

 

 

Keep watching The Server Room for information on the other 4 habits in the coming weeks.

0 Comments Permalink
1

I am currently sitting in PDX (Portland, Oregon) waiting for my flight to Dallas, then on to Sao Paulo, Brazil, then on to Porto Seguro, Brazil.

 

This is where our PR team is running an event called Intel Editor's Day (IED). This is the 3rd such event this year and I have had the pleasure of presenting Server Benchmarking at each event. The first in Mexico, the second in Costa Rica a few weeks ago. IED offers regional journalists a chance to get product information and demonstrations from Intel so they can be ready to report what they see, review, and evaluate when looking at these products in the market.

 

It's a team effort with these events. I am 'the server guy', yet I am carrying a few MID's and even an Atom demo card for my brethren (and sisteren?). I even have a wafer with me... something of the 45nm type. ;o)

 

The reason I am going is to talk with the 30+ journalists about distinguishing between client and server benchmarks. If you (the reader) don't know that there is a difference, there definitely is and you should let me know so we can offer some information for you to read about it.

 

My goal at IED is show a few benchmarks, talk about a few more (SPEC, TPC, etc.), and ultimately learn about how they might run these benchmarks. Education is the goal, but it goes both ways.

 

I'll add more when I get a chance... I have to grab a bite before getting on the plane.

1 Comments Permalink
1

Here's the 5th follow-up post in my 10 Habits of Great Server Performance Tuners series. This one focuses on the fifth habit: Know Your Workload.

 

Spend some time getting to know your workload.

 

 

 

 

 

The idea of a "workload" is integral to the concept of performance. The workload is the set of software and tests that you run on the server in order to measure its performance. Also part of the workload is the is concept of the "metric", which means, the number you will use to quantify performance. You should understand as much as you can about your workload in order to characterize and interpret your system's execution.

 

 

Let's look at the real-life example of a car's fuel economy. The EPA measures fuel economy using 2 workloads: city and highway. Each workload tests different aspects of the car's performance, and the metric used to quantify that performance is miles per gallon (MPG). Like the EPA's fuel economy test, a good workload for server performance tuning should have the following three characteristics:

 

  • Measurable - There is a quantifiable metric.

  • Reproducible - Measurements are repeatable and consistent.

  • Representative - The workload should be typical of normal operating conditions and should stress the parts of the system (including code) where performance is most critical.

 

Depending on the usage model for the server(s) you are tuning, some example appropriate workloads might be: loading websites , processing XML, encoding/decoding MP3s, responding to database queries, rendering frames, etc. Metrics could be time to run, number of users serviced, transactions processed per second, etc. If your metric is time, take special care that you are measuring it accurately.

 

After choosing or creating a suitable workload, spend some time getting to know it. Measure the variance between runs. Use O/S and processor-level tools (to be discussed in the blog for habit #8) to sample the workload's characteristics at various points during its execution.

 

 

One thing to remember about sampling is that you want to make your sample interval at least as long as the amount of time it takes to complete a unit of work in your workload. For example, suppose your workload is a stream of web page requests and you are measuring response time. If the longest response time you see is about 2 seconds, then you want to make sure you take samples over 2 seconds in length. It's best to use a multiple of your longest operation time, so 4 or 6 seconds in this case. This way you can be sure your samples include one complete operation in the workload. Then try to determine if the workload is stable - meaning, do the characteristics vary at different times during execution? (If so, you will need to sample more often to understand the workload or possibly split it into phases). Use the data to get an idea of your workload's CPU, memory, network, and I/O usage.

 

 

At the application level, become familiar with the software stack you will use. How is the workload generated (user, clients, test files, etc)? Understand the major operations that occur - what components of the O/S are needed? What device drivers are used? And finally, study the application(s). Know whether the application(s) being tested are single- or multi-threaded and as much as you can about the internals.

 

 

Choosing (or developing) an appropriate workload is necessary for correct performance measurement and tuning. Being as familiar as you can with the workload will help you to interpret your performance data and identify areas for optimization.

 

 

Keep watching The Server Room for information on the other 5 habits in the coming weeks.

1 Comments Permalink
3

As part of the Sun Microsystems and Intel alliance, the two companies have collaborated to bring open source Threading Building Blocks (TBB) support to the Solaris Operating System (OS) and Sun Studio software toolchain. Check out the SUN Blog for additional information. Click the video below for a short interview with Deepanker Bairagi, Principal Engineer for the Sun Studio.

 

 

 

Software parallelism can unleash the processing power that the newer multi-core architectures provide, including the Quad-Core Intel® Xeon® processors. For developers, multithreading offers a software parallelism model, but many existing solutions require a lot of low-level coding. Threading Building Blocks offers a rich approach to expressing parallelism in a C++ program by offering higher-level, task-based parallelism that abstracts platform details and threading mechanism for performance and scalability.

 

The Solaris OS is able to take advantage of multicore architectures, including the Intel Architecture, with features such as a lightweight processes (LWPs), load-balancing across cores, and processor affinities. Sun Studio software offers a complete integrated toolchain for Solaris and Linux platforms, including parallelizing compilers, performance and thread analysis tools, memory and code debuggers, NetBeans-based Integrated Development Environment, and more.

 

Combined with Threading Building Blocks, developers for the Solaris platform now have a fully loaded toolbox that simplifies the development of optimized multithreaded applications for multi-core Intel processors. Click here to learn more about Threading Building Blocks and optimizing performance for multi-core processors.

 

Would like to hear from the community on how you see this impacting the next generation of software development for Solaris running on Intel Architecture.

3 Comments Permalink
1

 

Here's the 4th follow-up post in my 10 Habits of Great Server Performance Tuners series. This one focuses on the fourth habit: Know Your BIOS.

 

 

 

My last blog talked about beginning your system tuning by consulting a block diagram. The other thing you should always look at is your system's BIOS. Many server BIOSes these days allow you to configure options that affect performance. Like everything in the performance world, which set of BIOS options will be best will depend on your workload!

 

 

First things first, how do you find this "BIOS"? Most servers have a menu called "Setup" (or something similar) that you can access while the system is booting, before it starts loading the operating system. This "Setup" menu allows you to access your system's BIOS. Changes that you make here will affect how the operating system can utilize your hardware, and in some cases how the hardware works. If you change something here, you usually have to reboot and then the change will "stick" through all future reboots (until you change it again). As platforms grow increasingly sophisticated, they are offering a widening array of user-configurable options in Setup. So a good practice is to examine all the menu options available whenever you get a new platform. Here are some of the most common options on Intel platforms that could affect performance:

 

 

  • Power Management - Intel's power management technology is designed to deliver lower power at idle and better performance/watt (without significantly lowering overall performance) in most circumstances. There are 2 types - P-States, which attempt to manage power while the processor is active, and C-States which work while the processor is idle. In some BIOSes, both of these features are combined into one option which you should enable. In other cases they are separated. If they are separate, here's what to look for:

    • Intel EIST (or "Enhanced Intel Speedstep" or "Intel Speedstep" or "GV3" on older platforms) - This is the P-State power management that works while the processor is active. Leave it enabled unless directed to change it by an Intel representative.

    • Intel C-States - If you have this option or something similar, it is referring to the power management used when the processor is idle. Enable all C-States unless directed by an Intel representative.

  • Hardware Prefetch or Adjacent Sector Prefetch - These options try to lower overall latencies in your platform by bringing data into the caches from memory before it is needed (so the application does not have to wait for the data to be read). In many situations the prefetchers increase performance, but there are some cases where they may not. If you don't have time to test these options, then go with the default. Intel tests the prefetch options on a variety of server workloads with each new processor and makes a recommendation to our platform partners on how they should be set. If, however, you are tuning and you have the time to experiment, try measuring performance using each of the prefetch setting combinations.

 

 

 

 

There are several other options that might affect performance on specific platforms. Some examples might be a snoop filter enable/disable switch, a setting to emphasize either bandwidth or latency for memory transactions, or a setting to enable or disable multi-threading. In these cases, if you don't have time to test, use your Intel or OEM representative's suggestion or go with the default setting.

 

 

Being familiar with how your system's BIOS is configured is another basic component of system tuning.

 

 

Keep watching The Server Room for information on the other 6 habits in the coming weeks.

1 Comments Permalink
3

 

Here's the 3rd follow-up post in my 10 Habits of Great Server Performance Tuners series. This one focuses on the third habit: Know Your Platform.

 

 

 

As we learned in my last blog, we should start our server performance tuning by looking for system-level bottlenecks. This involves understanding exactly how data flows into and out of your platform - and to do this, you need a block diagram. A block diagram shows the major components on the server's motherboard and the paths between them. From a good block diagram you can derive the maximum data transfer rate (aka bandwidth or throughput) achievable as data flows along those paths.

 

 

I usually look at my block diagram before beginning system tuning in order to identify potential bottlenecks. But some people use them in parallel: they measure the bandwidth of various parts of the system and then confirm what they see using the block diagram. You can determine if various parts of your system are heavily stressed, bottlenecked, or lightly utilized. In general you want to trace the path from where data enters your server (NIC, HBA, etc) up to the processor and back to memory or out of the server. The paths connecting one component to another are commonly known as buses. For each bus, multiply the speed by the width to determine the maximum potential bandwidth.

 

 

Let's use the block diagram for the Intel S5400SF server board as an example. It has 2 FSBs, each capable of 1333 or 1600 Mega-Transfers/second (MT/s). Each transfer on the FSB is 64 bits (8 bytes), so 8 bytes * 1,600,000,000 transfers gives a maximum theoretical bandwidth of 12.8GB/s per FSB segment. Keep in mind though that in reality a bus will not achieve its theoretical maximum bandwidth - depending on the type of bus it will probably realize 66-80% of the possible throughput.

 

 

 

 

So, where do you find these diagrams? If you are using an Intel server platform, the block diagrams can usually be found in the technical product specification for each board. If you purchase a platform from one of our OEM partners, ask your salesperson where to get it.

 

 

Look at the maximum bandwidth achievable on each link your data will travel over to gain a deeper understanding of how your workload will run on your platform.

 

 

Keep watching The Server Room for information on the other 7 habits in the coming weeks.

 

 

 

3 Comments 0 References Permalink
0

Here's the 2nd follow-up post in my 10 Habits of Great Server Performance Tuners series. This one focuses on the second habit: Start at the top.

 

Let me start by relating a true (although simplified) story. My team at Intel has built up years of expertise running a particular benchmark. So when the time came to start running a new, similar benchmark, we thought: "No problem." We began running tests while the benchmark was still in development. Immediately we had an issue: the type of problem that would normally indicate our hardware environment wasn't set up properly. We checked everything that we had seen cause the issue in the past, and we couldn't find anything. So, we blamed the new benchmark. After all, we were experts and we had been setting up these environments for years! We knew what we were doing. You can probably guess where this story is going: after weeks of doing things to work around the "benchmark issue", we figured out that we had mis-configured the environment, resulting in a bottleneck on one part of our testbed. We didn't thoroughly test that part of the environment because it had never caused us problems with the old benchmark. And of course, on the new benchmark it was critical. We had broken one of the most important rules of performance tuning: Start at the Top.

 

 

So now you know how easy it can be to not Start at the Top. Even seasoned performance engineers can get overconfident and forget this rule. But the consequences can be dire:

 

  • 1. You have to eat major crow when you realize your mistake. I'm just now getting over the humiliation.

  • 2. You might have put tunings in place to address issues that weren't really there. This is at best wasted work and at worst something that you have to painstakingly undo when you fix the real issue.

 

So...how do you avoid this situation? Simple: use the Top-Down Performance Tuning process. This means you start by tuning your hardware. Then you move to the application/workload, then to the micro-architecture (if possible). What you are looking for at each level are bottlenecks: situations where one component of the environment or workload is limiting the performance of the whole system. Your goal is to find any system-level bottlenecks before you move down to the next level. For example, you may find that your network bandwidth is bottlenecked and you need to add another NIC to your server. Or that you need to add another drive to your RAID array, or that your CPU load is being distributed un-evenly. Any bottlenecks involving your server system hardware (processors, memory, network, HBAs, etc), attached clients, or attached storage is a system-level bottleneck. Find these by using system-level tools (which I will touch on in the future blog for Habit #8), remove them, then proceed to the application/workload level and repeat the process.

 

 

 

Being vigilant about using the top-down process will ensure you don't waste time tuning a non-representative system. And it just may save you some embarrassment!

 

 

Always measure your bottlenecks!

 

 

Keep watching The Server Room for information on the other 8 habits in the coming weeks.

 

 

 

0 Comments 0 References Permalink
2

I have been working as a full-time performance engineer at Intel for 6 years. I started by benchmarking server products for performance validation and now I focus on the TPC-C and TPC-E OLTP server benchmarks. I have used a variety of workloads in this job and spent time optimizing each level of the performance hierarchy: application, system, and processor. I, like many of you, have learned the "tricks of the trade" the hard way: by trial, error, and success. I'm sharing now, so you can all benefit from the things I've picked up along the way.

 

Let's start with some general methodologies to follow when tuning performance, whether you do it full-time, as a hobby, or just in your spare cycles after getting your "regular work" done. I will follow up with a more detailed post on each habit individually.

 

 

 

 

 

1. Ask the right question: Why are you tuning your platform? What level of performance are you hoping to achieve? What do you (or your users) care most about: raw performance, cost/performance, performance/watt, or something else?

 

 

2. Start at the top: The first and easiest part of your application server to tune is the hardware itself. Move on to the software and workload only after you feel confident that you have removed any system-level bottlenecks.

 

 

3. Know your Platform: This should be where you begin your system (hardware) tuning. The first thing, which I can't stress enough, is to get a block diagram of your platform. Then study it!

 

 

4. Know your BIOS: Server BIOSes these days come with more and more options. Be sure to give your new platform's BIOS a once-over. Pay particular attention to options relating to performance and power.

 

 

5. Know your Workload: To quantify performance, you need a workload! Some examples: web server response time, boot time, frames rendered per second, simultaneous connections supported, etc. Understand as much as possible about how the work gets done.

 

 

6. Try one thing at a time: Little changes that seem harmless can significantly alter the behavior of your system. Or worse, they can interact with each other to wreak havoc. Always try one change at a time, and for goodness' sake, do habit number 7.

 

 

7. Document and Archive: When you change something, log it! For each experiment you do, store your hardware and software configuration, performance level, and any collected data.

 

 

8. Use the right tool for the job: There are free data collection tools out there for various levels of the tuning process. System tuning tools include such as Performance Monitor for Windows or Sar for Linux. Application-level tools include Intel ® VTuneTM for both Windows and Linux.

 

 

9. Don't break the law: Amdahl's Law, that is. Amdahl's Law tells us the maximum amount of performance improvement we will get from a particular enhancement. Amdahl can help you set your expectations properly and clue you in to when you should be suspicious.

 

 

10. Compare apples to apples: Todd Christ reminds us of this habit in the last paragraph of this post. Don't compare the performance of mis-matched systems. If you must do it, know exactly what the differences are: the processor, memory type/speed/vendor, a software component, chipset, etc. Dig into the configuration details!

 

 

So now you have the high-level list! Stay tuned to The Server Room for more information about each habit in the coming weeks.

2 Comments Permalink
1 2 Previous Next

Filter Blog

By author: By date: By tag: