Currently Being Moderated

If you look at the keynote slides from the recent Reimagine the Data Center event, Intel announced some of the products from its “Low Power Product Direction” with impressive figures for how low the power consumption now goes for both Xeon and Atom processors. Nevertheless in mission critical environments some Oracle users ask for maximum performance configurations at the expense of some additional power usage. In this scenario the users will select a BIOS option such as shown and assume the system will give the best possible performance all of the time.


However, they may be surprised to find that setting the “Maximum Performance” option may not actually be the best choice and could actually result in lower performance in some scenarios. With Oracle for example a busy OLTP environment may require high performance across all cores at the same time, whereas if DBA is building indexes or gathering statistics they may prefer to boost the performance for a single thread or process for that one job. Power and performance are closely intertwined and Intel processors give you fine grained control in exactly how you balance the system for both saving power and high levels of performance. In this post I will look at some of the CPU technologies such as Enhanced Intel SpeedStep Technology (EIST), Intel® Turbo Boost Technology, C-states and -Pstates and the Linux In-Kernel Governors and cpuidle drivers. In a subsequent post I will run some tests with an Oracle database on Linux to try out some of the configurations available.


The best place to start is Find your CPU model from /proc/cpuinfo as shown and look up the specifications, here is the example from my test system:


# cat /proc/cpuinfo | grep "model name"

model name  : Intel(R) Xeon(R) CPU E5-2680 0 @ 2.70GHz


The first thing we can observe is that the Clock Speed and Max Turbo Frequency are different. In fact the Max Turbo Frequency is much higher than the Clock Speed so we need to look at some of the technologies surrounding Turbo Boost to see if selecting “Maximum Performance” really is the best option.


There is some introductory information on Turbo Boost here – at the highest level this tells us 2 important aspects Firstly Turbo Boost is activated when the OS requests the highest processor performance state (P0) and secondly the actual frequency reached is dependent on algorithms evaluating the number of active cores and the processor current, power and temperature. In turn this tells us that we need to look into Performance States or P-States and also that the performance of one core on a processor is also partly dependent on the settings and workload on the other cores on the same processor meaning that we also need to look at the processor operating states or C-states.


P-states are voltage/frequency pairings controlled through Enhanced Intel SpeedStep Technology (EIST) and therefore If you disable EIST at the BIOS then you should also find that your BIOS disables any options to use Turbo Boost. With our example processor the “Max Turbo Frequency” is state P0, “Clock Speed” is P1 and other pairings are defined within the range P1 to Pn.  The ranges P1 to Pn are controlled by the Linux operating system and the active configuration can be seen in /proc/cpuinfo for example the extract below shows 2 cores on the test system at different P-States.


[root@sandep1 ~]# cat /proc/cpuinfo | grep -i mhz

cpu MHz           : 1200.000

cpu MHz           : 2701.000



In contrast the state P1 to P0 is hardware controlled and the frequency achieved by a core dependent on the factors mentioned previously up to the Max Turbo Frequency, this frequency is not seen through the /proc/cpuinfo interface.  In a Linux environment P-States are controlled by the In-Kernel Governors. By default the ondemand governor is active and the powertop utility shows the available and active frequencies across the system.


PowerTOP version 1.11      (C) 2007 Intel Corporation

Cn Avg residency       P-states (frequencies)

C0 (cpu running)        ( 8.7%)         2.71 Ghz    25.0%

polling           0.0ms ( 0.0%)         2.50 Ghz     0.0%

C1 mwait          0.2ms ( 0.1%)         2.00 Ghz     0.6%

C2 mwait          0.6ms ( 0.1%)         1500 Mhz     0.0%

C3 mwait          0.0ms ( 0.0%)         1200 Mhz    74.4%

C4 mwait         12.5ms (91.0%)


You can choose read and set your governor on per CPU basis as follows:


cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor

echo "ondemand" > /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor


or across the whole system with  a script. Alternatively the package cpufrequtils is available on the install media with the commands cpufreq-info and cpufreq-set , however  if using this utility I recommend manual checking to ensure that you get the settings you have chosen.


In addition to ondemand you have the powersave governor. This sets the frequency to the lowest available in this case 1200 MHz. In an Oracle environment it is highly unlikely that you would wish to do this. More importantly you also have the performance governor. By setting the performance governor for all CPUs in this looking in /proc/cpuinfo in this example will show 2701.000 MHz for all CPUs and using powertop you can see a state given as 100% Turbo Mode.


PowerTOP version 1.11      (C) 2007 Intel Corporation

Cn Avg residency       P-states (frequencies)

C0 (cpu running)        ( 0.0%)       Turbo Mode   100.0%

polling           0.5ms ( 0.0%)         2.71 Ghz     0.0%

C1 mwait          0.2ms ( 0.1%)         2.50 Ghz     0.0%

C2 mwait          0.4ms ( 0.1%)         2.40 Ghz     0.0%

C3 mwait          0.0ms ( 0.0%)         2.31 Ghz     0.0%

C4 mwait        14.7ms (110.1%)


In other words the operating system is asking for state P0 for all CPUs and passing control to the hardware to activate Turbo Boost dependent on the factors noted previously up to its full range. Those wishing to achieve “Maximum Performance” with minimum latency of switching between P-States can choose the Performance Governor, those wishing to also save power can remain with the default governor of ondemand.


From the Powertop output we can also see additional CPU operating states called C-States. As powertop shows C-State C0 is a state when the CPU is running and therefore all P-States are active within C-State C0, all other C-States C1-Cn show when the CPU is idle and not executing instructions. To save energy parts of the CPU can be powered down and a deeper C-state has greater power savings however there are greater latencies to then return to C0.  In similarity to EIST, you can set C-States at the system BIOS and interestingly checking the manual for the system with the “Maximum Performance” BIOS has the entry.


Minimum Processor Idle Power State = No C-states


However as we have seen our hardware controlled Turbo Frequency of P0 is partly dependent on the number of “active” cores or with the further detail we now have the C-state residency of those cores. Therefore choosing “No C-states” could reduce the ability to reach the full potential Max Turbo Frequency and therefore reduce potential performance in some scenarios. 


We do need further information however as even though you may have chosen “No C-states”  at the BIOS you will find that without additional kernel parameters an up to date Linux OS on Intel Processors will go ahead and use them anyway. This is because by default the cpuidle driver that manages the processor idle state is intel_idle which will utilise C-States irrespective of the BIOS settings.  You should see a message in your dmesg output such as:


using mwait in idle threads

ACPI: acpi_idle yielding to intel_idle


Using the turbostat utility from the pmtools power management package you can observe the particular C-state that is active on a core. For simplicity I have truncated that output showing the package C-States to only show the core C-States.


pkg core CPU   %c0   GHz  TSC   %c1    %c3    %c6   %c7  

               0.04 2.31 2.69  20.19   0.11   0.00  79.66

   0   0   0   0.11 2.42 2.69   5.39   0.31   0.00  94.20  

   0   1   1   0.10 2.72 2.69   1.64   0.84   0.00  97.41

   0   2   2   0.02 2.93 2.69   0.03   0.00   0.00  99.95

   0   3   3   0.00 2.41 2.69   4.24   0.00   0.00  95.76

   0   4   4   0.00 2.30 2.69  12.34   0.00   0.00  87.66  


You can see that by default when idle even though the performance governor has been set the deep C-state C7 has been entered on all of the cores when idle.  If looking to control which C-states are entered instead of the BIOS setting you can use the kernel parameter intel_idle.max_cstate. Note that the C-States are defined to the ACPI standard so with the example processor setting the parameter intel_idle.max_cstate=2 will ensure that the processor goes no deeper than C-State 3, %c3 when viewed in turbostat and similarly intel_idle.max_cstate=3 ensures no deeper than C-State 6, %c6.  You can observe the difference in latencies measured in microseconds in /sys/devices/system/cpu/cpu0/cpuidle for a particular C-State.


[root@sandep1 cpuidle]# cat ./state2/latency


[root@sandep1 cpuidle]# cat ./state3/latency



As we saw previously the mwait instruction is used in idle threads and this means C-State C1 will be entered whenever the core is idle irrespective of other settings. Setting intel_idle.max_cstate=0 however disables the use of intel_idle (it does not restrict the C-States to C0) and falls back to acpi_idle however acpi_idle is not aware of individual core C-states and only at the package level.  When using mwait even when  intel_idle is disabled C1 will be used and only adding an additional parameter such as “idle=poll” will keep the CPU in state C0 at all times using considerably more power to do so.


pkg core CPU   %c0   GHz  TSC   %c1  %c3    %c6    %c7

               6.15 3.04 2.69  93.85 0.00   0.00   0.00  

   0   0   0   0.61 3.07 2.69  99.39 0.00   0.00   0.00

   0   1   1   1.12 3.09 2.69  98.88 0.00   0.00   0.00

   0   2   2  92.89 3.09 2.69   7.11 0.00   0.00   0.00  


The disadvantage however is that setting any numerical value lower than intel_idle.max_cstate=2 will prevent some of the cores entering a C-State deep enough to then enable another core in the same package reach the full turbo frequency.


Another further CPU setting is the Energy/Performance Bias and Red Hat and Oracle users should note that the default setting has changed in the Linux kernel used between the releases of Red Hat/Oracle Linux 5 and Red Hat/Oracle Linux 6. (Some system BIOS options may include a setting to prevent the OS changing this value). In release 5 Linux did not set a value for this setting and therefore the value remained at 0 for a bias towards performance. In Red Hat 6 this behaviour has changed and the default sets a median range to move this bias more towards conserving energy (remember the same Linux kernel is present in both ultrabooks as well as  servers and on my ultrabook I use powertop and the other Linux tools and configurations discussed here to maximise battery life) and reports the following in the dmesg output on boot.


ENERGY_PERF_BIAS: Set to 'normal', was 'performance'

ENERGY_PERF_BIAS: View and update with x86_energy_perf_policy(8)


One impact of changing this setting can be the time that a process runs before the full turbo boost frequency is used. To change this setting you can use the x86_energy_perf_policy tool.  With this tool the default energy performance policy can be read.


[root@sandep1 x86_energy_perf_policy_tool]# ./x86_energy_perf_policy -r

cpu0: 0x0000000000000006

cpu1: 0x0000000000000006


You can also use the tool to set a lower value to change the bias entirely towards performance (the default release 5 behaviour).


[root@sandep1 x86_energy_perf_policy_tool]# ./x86_energy_perf_policy -v performance

  1. CPUID.06H.ECX: 0x9

cpu0 msr0x1b0 0x0000000000000006 -> 0x0000000000000000

cpu1 msr0x1b0 0x0000000000000006 -> 0x0000000000000000


In summary if you want a “Maximum Performance” configuration setting for Oracle your first step is to find out the potential of the processors that you are using and to ask your vendor what parameters the high-level BIOS settings actually set as different vendors may enable different parameters.  If you wish to utilise Turbo Boost do not disable EIST and do not disable C-States at the BIOS or disable the intel_idle cpuidle driver. You can use the intel_idle.max_cstate kernel parameter if you wish to control C-State behaviour, however always test and observe before making changes whilst being aware that enabling some cores to utilise deeper C-States may allow other cores to reach higher Turbo Frequencies for longer. You should also have noted that in fact the default system and Linux settings are often a good starting point for a balanced configuration. On Linux you can use the tools, turbostat, powertop and x86_energy_perf_policy_tool to observe and modify your system behaviour and in the next post I will look at running some simple tests to observe the configuration discussed here.


Filter Blog

By author:
By date:
By tag: