1 2 Previous Next

The Data Stack

25 Posts authored by: steveshaw

Writing this blog post on St Patrick’s Day I am reminded through the St Patrick’s Day: Google Doodle that amongst other things St Patrick’s Day is a day that brings Myths and Legends to the fore and is therefore an ideal day to look at Oracle redo on SSDs – surely one technology topic that provides a great deal of conflicting opinion.  For example the Oracle support document Troubleshooting: "log file sync" Waits (Doc ID 1376916.1) describes the following issue:

 

If the proportion of the 'log file sync' time spent on 'log file parallel write' times is high, then most of the wait time is due to IO … As a rule of thumb, an average time for 'log file parallel write' over 20 milliseconds suggests a problem with IO subsystem.

 

and includes the following as one of the recommendations:

 

  • Do not put redo logs on Solid State Disk (SSD)
  • Although generally, Solid State Disks write performance is good on average, they may endure write peaks which will highly increase waits on 'log file sync'

 

This seems unequivocal, firstly if your log file parallel write time is on average (and not peak) greater than 20 milliseconds you should tune your IO, this I can agree with, however “Do not put redo logs on Solid State Disk (SSD)”  is worth investigating to see if it is really based on established fact.  After all I have used SSDs exclusively for Oracle workloads for the last 5 years, starting with the Intel® SSD X25-E Series in 2009 and currently with the Intel® SSD 910 Series and DC S3700 Series. I have also worked with other storage for Oracle but always an SSD based solution.  Not only do SSDs have excellent latency characteristics, and the modern endurance of Intel SSDs measured in DWpD essentially eliminates concerns over endurance, also modern SSD’s are very, very good at data path protection exactly what you need for high intensity redo.  Nevertheless Oracle Support are specific on the issue they have identified that SSD write performance may be good on average, however it is “write peaks” that impact log file sync times and using the rule of thumb of an average time for 'log file parallel write' over 20 milliseconds we would expect these “write peaks” to exceed 20 milliseconds to cause these concerns.  This is something we can test for to see if there is truth behind the myths and legends that always surround disruptive technologies.

 

Log File Sync and Log File Parallel Write

 

Stepping back a bit it is worth clarifying the relationship between ’log file sync’ and ‘log file parallel write.  ’Log file sync’ is the time elapsed by a foreground session after waiting for the redo it has put in the memory resident redo log buffer to be flushed to disk when it issues a COMMIT (or ROLLBACK) to make the transaction permanent and ‘log file parallel write’ which is the time it takes for the redo to be written to disk by the background process.  Prior to 12c this background process doing the writing was the log writer (LGWR) however at 12c the’ log file parallel writes’ are performed by a number of log writer workers (LGnn) with LGWR time elapsed under the ‘target log write size’ event.   Historically the log writer and foreground processes have communicated with a post/wait mechanism using semaphores (semctl) however a more recent parameter _use_adaptive_log_file_sync  being set to true by default since 11g indicates that polling may be used by the foreground processes instead with the method used dynamically selected. What this means is that whereas ‘log file parallel write’ is the actual write to disk, ’log file sync’ also includes the scheduling of and communication between the foreground and background processes and therefore on a busy system most of the time spent in ’log file sync’ by the foreground process may not be waiting for the ‘log file parallel write’ by the background process. However if there are significant “write peaks” then there may be a number of foreground processes waiting in ’log file sync’ for that write to complete exacerbating the elapsed time.  Consequently what we want to do is capture the time spent on ‘log file parallel write’ by the background processes on SSD storage to observe whether it  will highly increase waits on 'log file sync'.

 

Capturing Log File Parallel Write

To do this I used a 4 socket E7 v2 system to drive a significant level of throughput through Oracle 12c running on Oracle Linux 6.5.  It is worth noting that the E7 v2 includes the Intel® Integrated I/O  and Intel® Data Direct I/O Technology features and therefore the CPU latency aspect of I/O is further minimised.  For the SSD storage I used 2 x Intel® SSD 910 Series configured with Oracle ASM as per the post referenced here. Naturally I used HammerDB for the workload and configured the redo logs of a size to ensure log file switch checkpoint activity during the test.

 

I could list the Log Writer and Log Writer worker processes as follows:

 

oracle   127307      1  0 10:23 ?        00:00:00 ora_lgwr_IVYEXDB1

oracle   127312      1  0 10:23 ?        00:00:01 ora_lg00_IVYEXDB1

oracle   127316      1  0 10:23 ?        00:00:00 ora_lg01_IVYEXDB1

oracle   127320      1  0 10:23 ?        00:00:00 ora_lg02_IVYEXDB1

oracle   127322      1  0 10:23 ?        00:00:00 ora_lg03_IVYEXDB1

 

and begin a trace of all of the Log Writer Workers as follows with the example for process ora_lg03_IVYEXDB1 above.

 

[oracle@ivyex1 ~]$ sqlplus sys/oracle as sysdba

SQL> oradebug setospid  127322;

Oracle pid: 21, Unix process pid: 127322, image: oracle@ivyex1.example.com (LG03)

SQL> oradebug event 10046 trace name context forever, level 8;

Statement processed.

 

I ran a test for 10 minutes with a number of virtual users to run the system at a high CPU utilisation generating a significant number of transactions and then when complete stopped the trace as follows before collecting the trace files from the trace directory:

 

SQL> oradebug event 10046 trace name context off;

Statement processed.

 

Looking in the trace file it shows the timing of the event 'log file parallel write' that we are interested in as shown below:

 

WAIT #0: nam='log file parallel write' ela= 501 files=1 blocks=58

WAIT #0: nam='LGWR worker group idle' ela= 39

WAIT #0: nam='log file parallel write' ela= 466 files=1 blocks=52

WAIT #0: nam='LGWR worker group idle' ela= 33

WAIT #0: nam='log file parallel write' ela= 368 files=1 blocks=54

 

Of course on of the advantages of HammerDB is that with a scripted interface you are not restricted to simply running the pre-built workloads, you can run any workload you choose. Therefore in the Script Editor window of HammerDB I entered the following to process the trace files and ran it to extract the elapsed time into a CSV format.

 

#tclsh

set filename "lg00.trc"

set filename2 "output00.csv"

set fid [open $filename r]

set fid2 [open $filename2 w]

set elapsed 0

set maxelapsed 0

set count 0

set overmilli 0

set over10milli 0

set over20milli 0

while {[gets $fid line] != -1} {

if {([string match {*log\ file\ parallel\ write*} $line])} {

incr count

regexp {ela=\ ([0-9]+)\ } $line all elapsed

puts $fid2 "$count,$elapsed"

if { $elapsed > 1000 } {

incr overmilli

if {  [ expr $elapsed > $maxelapsed ] } { set maxelapsed $elapsed }

if { $elapsed > 10000 } {

incr over10milli

if { $elapsed > 20000 } {

incr over20milli

                 }

             }

         }

     }

}

puts "max elapsed was [ format "%.2f" [ expr {double($maxelapsed)} /1000 ]] millisecond"

puts "[ format "%.2f" [ expr {double($overmilli) / $count} * 100 ]]% over 1 millisecond"

puts "[ format "%.2f" [ expr {double($over10milli) / $count} * 100 ]]% over 10 millisecond"

puts "[ format "%.2f" [ expr {double($over20milli) / $count} * 100 ]]% over 20 millisecond"

close $fid

close $fid2

 

The output from the script was summarised as follows (Note that as shown below nearly all of the workload went through workers LG00 and LG01 and LG02 and LG03 were mostly idle) :

 

Log Writer Worker

Over  1ms

Over 10ms

Over 20ms

Max Elapsed

LG00

0.56%

0.02%

0.00%

13.66ms

LG01

0.39%

0.01%

0.00%

13.44ms

LG02

7.96%

0.07%

0.00%

13.05ms

LG03

7.07%

0.08%

0.00%

13.00ms

 

And  I verified that the output from the trace files corresponded with the output from v$event_histogram that over 99% of the redo writes were completed in less than 1 millisecond and the maximum elapsed write time was consistent around 13 milliseconds but also was an extremely small proportion of all the writes.

 

SQL> select wait_time_milli, wait_count from v$event_histogram where event = 'log file parallel write';

 

WAIT_TIME_MILLI WAIT_COUNT

--------------- ----------

            1    2460371

            2       6605

            4       1702

            8       1774

           16        726

 

Of course though how busy was the system?  The AWR report shows a load average around 75%

Host CPU

CPUs

Cores

Sockets

Load Average Begin

Load Average End

%User

%System

%WIO

%Idle

120

60

4

75.13

114.68

73.1

5.1

0.5

21.2

 

And a fairly busy redo rate of almost 340MB/sec.  Given the redo per transaction and the redo per second it is clear that this system is processing tens of thousands of transactions a second and millions of transactions of a minute.


Load Profile

Per Second

Per Transaction

DB Time(s):

160.0

0.0

DB CPU(s):

93.2

0.0

Redo size (bytes):

339,512,040.8

5,330.1

 

And the total time spent waiting for ‘log file sync’


Foreground Wait Events

Event

Waits

%Time -outs

Total Wait Time (s)

Avg wait (ms)

Waits /txn

% DB time

log file sync

26,850,803

38,253

1

0.70

39.68

 

was considerably greater than the ‘log file parallel write’ component and therefore consistent with high performance IO with most ‘log file sync’ time spent in communication between foreground and background processes.


Background Wait Events

Event

Waits

%Time -outs

Total Wait Time (s)

Avg wait (ms)

Waits /txn

% bg time

log file parallel write

1,958,168

0

763

0

0.05

125.03

target log write size

977,630

0

574

1

0.03

94.05

 


I then loaded the CSV file into a spreadsheet, highlighted the data and selected scatter plot. Although they don’t overlay perfectly on a time basis the data is close enough to combine for LG00 and LG01 and LG02 and LG03 respectively, noting that LG00 and LG01 were considerably busier workers than LG02 and LG03. The data for LG00 and LG01 is here:


lg0001.PNG


and the number of blocks written for LG00 only to indicate write latency against time as follows:


lg00blk.PNG


And to complete the picture the latency data for LG02 and LG03 is here:


lg0203.PNG

 

Clearly the latency is as we would expect strongly related to the amount of redo written and typically and on average the response time for redo writes is sub-millisecond even though we are running at a throughput of millions of transactions a minute and generating over 20GB of redo per minute.  Where there are longer latencies these are typically in the order of 12 to 13ms for 0.01 to 0.02% of writes so  certainly not a factor to highly increase waits.  Furthermore, cross-referencing with iostat data even at this rate of redo the average disk utilization was still only around 10%.

 

Running a 3 Hour Workload

 

Of course maybe a 10 minute test is not sufficient, so I then ran the same configuration for 3 hours generating as shown 342MB/sec of redo with 4.7 Terabytes in total.

 

Function Name

Writes: Data

Reqs per sec

Data per sec

Waits: Count

Avg Tm(ms)

LGWR

4.7T

11574.37

342.071M

49M

0.01

DBWR

1.5T

7746.17

110.681M

0

 

And the waits on ‘log file parallel write’?   proportionally the same as before.

 

 

SQL> select wait_time_milli, wait_count from v$event_histogram where event = 'log file parallel write'

 

WAIT_TIME_MILLI WAIT_COUNT

--------------- ----------

            1   51521618

            2     165210

            4      36313

            8      36242

           16      18478

 

 

COMMIT_WRITE and COMMIT_LOGGING

 

There is one further test we can do.  By setting the following parameter as follows:

 

commit_logging

BATCH

commit_wait

NOWAIT

 

redo will continue to be generated however with commit_logging set to ‘batch’ the foreground process will not notify the log writer to write its redo and with commit_wait set to ‘nowait’ the foreground process will also not wait for the log writer to notify it that the redo has been written. In other words these settings remove the scheduling aspect of ‘log file sync’ and therefore if the disk is not a bottleneck we should see throughput increase while noting with batches that the writes will be larger.  Sure enough this gives a significant increase in throughput with Oracle now writing over 400MB/sec of redo and CPU utilisation at 97%.

 

Per Second

Per Transaction

Redo size (bytes):

431,002,118.4

5,336.1

 

In other words removing the waits for scheduling meant that the SSDs could easily cope with higher levels of throughput.

 

V$LGWRIO_OUTLIER

 

In on my previous posts on Redo on SSD I mentioned a new performance view in 12c called V$LGWRIO_OUTLIER. This view reports log writer writes that take over 500ms. I touched on this in passing but didn’t investigate further so decided to take another look and sure enough the new  view had entries.  This is not the only place that long writes to the log file are reported. A trace file is also generated when a write to the log file takes more than 500ms for example:

 

Warning: log write elapsed time 780ms, size 4KB

 

So it would be reasonable to expect these to correlate and for a trace file to be generated for each entry in V$LGWRIO_OUTLIER.  It does not help that the documentation is not explicit about the precise time value that this view mentions however from seeing related output it seems reasonable  that the IO_LATENCY value is reported in milliseconds so we should expect entries above the value of 500. Additionally the view does not report a timestamp for each entry however the underlying x$ view does.

 

SQL> select view_definition from v$fixed_view_Definition where view_name ='GV$LGWRIO_OUTLIER';

 

So querying these ordered by timestamp shows the following result:

 

SQL> select IO_SIZE, IO_LATENCY, TIMESTAMP from X$KSFDSTLL order by timestamp;

 

   IO_SIZE IO_LATENCY  TIMESTAMP

---------- ---------- ----------

      32 2694340698 1394629089

      72 2695317938 1394629186

      28 2696135038 1394629268

      40 2696636948 1394629318

      64 2697984098 1394629454

      52 2698638788 1394629519

      68 2699724048 1394629628

      56 2699982768 1394629653

      24 2700330618 1394629688

      32 2752639988 1394634918

      72 2752946678 1394634949

      36 2753861848 1394635041

     108 2754161328 1394635071

      64 2754341738 1394635089

 

and if the value is in milliseconds then 2754341738 milliseconds is approximately equivalent to 1 month! and the latency is increasing every time. If we cross reference this against the trace data, the AWR report, event histogram and log write trace files and the only correlation appears to be with the timestamp.  Just to be certain after running the 3 hour test  the outlier data showed  396 entries (note that V$LGWRIO_OUTLIER only shows the last 180) all in ascending order.

 

SQL> select IO_SIZE, IO_LATENCY, TIMESTAMP from X$KSFDSTLL order by timestamp;

   IO_SIZE IO_LATENCY  TIMESTAMP

---------- ---------- ----------

      40 3471738158 1394706828

      44 3472415808 1394706896

      52 3472562658 1394706911

      16 3475418188 1394707196

      16 3475860828 1394707240

      40 3477258658 1394707381

      …

     128 3626810268 1394722336

     256 3627883738 1394722444

      20 3627919588 1394722447

     128 3628119698 1394722467

     320 3628620898 1394722517

     112 3629137298 1394722569

 

396 rows selected.

 

This is sufficient evidence for me to suggest that for the time being V$LGWRIO_OUTLIER on Linux should not be relied upon for measuring IO Latency or at the very least there is insufficient documentation to accurately interpret what it is meant to show.

 

Conclusion

 

So in summary should you put Oracle redo on SSD?  If you want sub-millisecond response times  coupled with high levels of data path protection  then surely the answer is yes. In fact more to the point a modern SSD can handle redo with such high throughput and low latency that the ‘log file parallel write’ component of ‘log file sync’ and therefore the SSD write performance even with systems generating redo at very high rates is not the determining factor.  If it is then I would recommend reviewing how to correctly configure Oracle redo on SSD before taking your own measurements of Oracle redo performance on SSD.

If you work with mission critical databases you won’t have missed the launch of the Intel® Xeon® Processor E7 v2 Family with up to double performance over the previous generation.   Benchmarks can be useful in comparing and contrasting systems and these show that E7 v2 has Up to 80% higher performance than IBM POWER7+ at up to 80% lower cost . Nevertheless working directly with customers evaluating database performance I often find that published benchmarks provide some but not all of the answers that database professionals are looking for including single and multi-threaded performance, power efficiency, platform cost, software cost, reliability and operating system and virtualization software choice.

 

As a consequence especially with a noticeable long term decline in the frequency of published benchmarks more and more customers have an engineering team to get ‘hands-on’ with evaluating systems for themselves.  It is therefore great to see a white paper from Principled Technologies that shows the approach to do just that illustrating how a system built on the Intel Xeon processor E7-4890 v2 is a much better value proposition than POWER7+ for running the Oracle database. The white paper shows that the Xeon E7 v2 system has 69% lower hardware purchase costs, up to 42% lower power consumption under load and 40% under idle and 16% higher performance at equivalent load with twice the headroom to grow.  All of this contributes to 5.7x performance/watt advantage for Xeon E7 v2.

 

More importantly for anyone taking the approach of ‘hands on’ evaluation the white paper includes all the details required for any database engineer to run their own equivalent in-house evaluation and not forgetting SPARC you can run equivalent tests against any system that runs  Oracle or other databases.  No-one should have to accept benchmark data prepared by competitive analysts and published without full details of system and database configuration.

 

I have to admit that as the author of HammerDB (Intel enabled engineers to develop open source projects long before it was fashionable)  the load testing tool used in the white paper I especially like the open methodology as it aligns with the intentions for developing such a tool.  Firstly being open source all of the code right down to the proprietary database drivers is published empowering the user. If you don’t like the workload you are free to change it – you can even write a whole new workload if you wish (HammerDB’s conversion of Oracle trace files makes this easy) and then contribute that workload back to the database testing community.

 

Single-Threaded Performance

 

With this approach you may see from previous blog posts that typically the first test run I run is a PL/SQL CPU routine (or T-SQL on SQL Server) to test single-threaded performance and verify the system BIOS and operating system settings.  Running this on an E7-4890 v2 gives me the following result:

 

Res = 873729.72

PL/SQL procedure successfully completed.

Elapsed: 00:00:07.88

 

And comparing this with previous results shows good incremental gains and if you look at the other results for Intel Core then there are indications of further gains from future Xeon products from Intel’s Tick-Tock model.  It would be great if you could fully test a database system in under 10 seconds however such a single-threaded test is only a first indicator, a small piece of the puzzle in evaluating database performance. Of course such as test tells us nothing of the platform scalability running multiple threads however the historical results show that the single-threaded performance is improving in step with the Tick-Tock model and you would expect better single-threaded performance from a processor that completes this test in a shorter time.

 

Multi-Threaded Performance

 

This is where HammerDB comes in. You can either run your own workload or one based on the specifications of industry standard benchmarks. The advantage of these specifications is that they are relatively straightforward to implement but more importantly they are designed to scale and have proven over time that they do scale so as long as the database software and system scales, then you have a workload that scales as well. Importantly it is not the absolute performance but the relative performance that is important and your aim should be to generate a Performance Profile.  What this means is that you should test your system at ever increasing increments of load until it is fully saturated.  HammerDB includes a CPU monitor and at the top of your performance curve with a scalable workload and sufficient I/O your CPU utilisation should look as follows:

 

orp1.png

 

With this reference data you then have the background information to measure and test a median load that you would expect from a typical production database and this is the approach that the Principled Technologies white paper takes.

 

Power Consumption

 

As the white paper shows another important metric is the power consumed by the server for a relative level of performance.  At data centre scale power consumed is one of the most important metrics for server density. Starting with the processor the power related metric is called TDP (Thermal Design Power) and indicates the power dissipated at the operating temperature called Tcase, further information is available here.  So although TDP will not give you all the information on peak CPU power requirements it remains the most useful processor metric for comparison.  If you want to know the max TDP for the E7 v2 family they are all published here showing a Max TDP ranging from 105W to 155W. Unfortunately for doing an evaluation the same data is not published for the IBM Power family, for SPARC it is published on the datasheet for the SPARC T-4 at 240W however is omitted from the equivalent datasheet for the SPARC T-5.  Consequently the Principled Technologies approach is ideal measuring the entire server power both idle and under load – a 5.7x performance/watt advantage translates into a significant advantage for server density for a data centre running E7 v2 compared to Power7+.

 

Hardware and Software Cost and Choice

 

After having measured both performance and power you can then evaluate cost. This takes the form of evaluating both the hardware acquisition cost as well as the software cost both comprising the TCO (Total Cost of Ownership).   As the white paper shows there is a 3.2X advantage for the E7 v2 system compared to the IBM system in hardware acquisition cost however a significant component of the TCO is the software license cost.  Given that the Oracle Database license is calculated per core and the IBM Power system has 32 cores compared to the 60 on the Intel  E7 v2 system it is important to reference the Oracle Processor Core Factor Table, this shows that the Power 7+ attracts a core factor of 1.0 compared to the 0.5 factor for the E7 v2 – this means that as well as offering higher performance the E7 v2 system also has a lower Oracle software license cost.  Additionally for the core based cost sensitive with the E7 v2 there are further options up to the E7-8893V2, with 6 cores running at higher frequencies under the same TDP allows choice in managing core factor based software acquisition cost.   Furthermore, choosing E7 v2 means that the same industry standard platform is available from multiple vendors supporting multiple operating systems such as Linux, Windows and VMware,  relational databases such as SQL Server, PostgreSQL and MySQL and Big Data technologies such as Hadoop, giving unparalleled technology choice all helping to drive down acquisition cost and improve manageability by standardising systems and system administrator skills.

 

Reliability

 

Of course some attributes by their nature are more difficult to measure than others. RAS (Reliability, Availability and Serviceability) by definition is harder to quantify as it is harder to measure something that is expected not to happen (like a service outage)  rather than something that does. Therefore evaluating RAS for E7 v2 requires looking at the features of Intel® Run Sure Technology and the longer term uptime data proving out reliability in the animation.

 

Conclusions

 

In this post we’ve looked at how the white paper from Principled Technologies illustrates a methodology for bringing platform evaluation in-house for comparing database performance. We’ve seen how database benchmarks published along with full configuration information can provide useful indicators however the declining frequency of these publications and the availability of free and open source tools is giving rise to the increased popularity of this in-house testing to determine the ‘best-fit’ database platform freeing information and enabling better data centre technology choices than ever before.

In my previous posts How to Maximise CPU Performance for the Oracle Database on Linux and Testing C-State Settings and Performance with the Oracle Database on Linux I described the relation of C-States and P-States to database CPU performance on Linux.  For these I use an easy Oracle PL/SQL based test to measure performance and I have been asked if I use an equivalent T-SQL approach for SQL Server. The answer is yes so in this post I cover how to quickly verify that you are getting the maximum CPU performance for SQL Server from your system.


Firstly if you are not familiar with P-States and C-States or the importance of the correct BIOS settings then the posts referenced above provide the technical details relevant to both Linux and Windows and you should be able to translate the details between the two operating systems. For example in regards to Linux I reference the Powersave, Ondemand and Performance frequency scaling governors, whereas in Windows you have the Power Saver, Balanced and High Performance Power Plans. The similarity is not coincidental and nor is the fact that the default plans are Ondemand and Balanced in Linux and Windows respectively.   Similarly for Linux I use the PowerTop utility to observe current C and P-States. In Windows if you run the command “powercfg –energy” you can observe on up to date systems an “Idle State Type of ACPI Idle (C) States” and “Performance Controls Type ACPI Performance (P)” confirming that the operating system is taking advantage of the same underlying technology.


With a common background established you should be aware that the use of Turbo Boost Technology is going to play just as important a factor on Windows as it does on Linux for performance and therefore the next place to look is ark.intel.com. Find your CPU model from your system information (such as right-clicking on “This PC”) and look up the specifications, here is the example from my test system:

 

 

ssp1.png



Even though this CPU is detailed as intel ® Core ™ i7-4770K CPU @ 3.50GHz in certain circumstances this CPU can run at 3.9GHz and therefore the next step is to identify the correct utilities that can show the active frequencies whilst noting that with multiple cores the individual cores can and will be able to run a different frequencies from the cores in the same and different sockets in the same server.  Within the updated Task Manager with Windows 8 and Windows Server 2012 the Turbo Boost frequency is shown, with previous and current versions of Windows the downloadable utility CPU-Z will also show actual frequencies therefore I will use this tool for measuring SQL Server performance.

 

Moving over to SQL we want a simple T-SQL routine that will run on a single core or thread using maximum frequency to test the CPU configuration. Such a routine written in PL/SQL is described for Oracle and therefore an equivalent port of this routine to SQL Server is shown below:

 

 

USE [tpcc]

GO

SET ANSI_NULLS ON

GO

CREATE PROCEDURE [dbo].[CPUSIMPLE] 

AS

   BEGIN

      DECLARE

         @n numeric(16,6) = 0,

         @a DATETIME,

         @b DATETIME

      DECLARE

         @f int

      SET @f = 1

      SET @a = CURRENT_TIMESTAMP

      WHILE @f <= 10000000 

         BEGIN

      SET @n = @n % 999999 + sqrt(@f)

            SET @f = @f + 1

         END

         SET @b = CURRENT_TIMESTAMP

         PRINT 'Timing = ' + ISNULL(CAST(DATEDIFF(MS, @a, @b)AS VARCHAR),'')

         PRINT 'Res = ' + ISNULL(CAST(@n AS VARCHAR),'')

   END

 

 

Being close to the PL/SQL original here the routine completes in similar times to the Oracle based original on the same system and therefore also provides a first test as an indication of the potential performance gains compared to databases running on RISC based architectures.

 

To create the procedure enter the T-SQL above in a query window specifying the database in which to create the routine (In this case [tpcc]) and click on Execute.  This creates a stored procedure called dbo.CPUSIMPLE in your database.  To run the stored procedure right click on the procedure name and select “Execute Stored Procedure …” This brings up a New Query with the following and begins to Execute.

 

 

USE [tpcc]

GO

DECLARE       @return_value int

EXEC   @return_value = [dbo].[CPUSIMPLE]

SELECT 'Return Value' = @return_value

GO

 

 

When complete, if successful under Results it shows a Return Value of 0 and under Messages it shows the result and completion time, for example:


Timing = 6500

Res = 873729.721235

(1 row(s) affected)



The Result should always be the same and it is the completion time we are interested in – in this case 6.5 seconds.  This may vary slightly but only by a narrow margin.

 

We can now modify the settings in the Power Plan (accessed under Control Panel\Hardware and Sound\Power Options) and observe the impact on SQL Server performance when running the routine by using Tasking Manager and or CPU-Z.  By default on the test system it is running the Balanced power plan and at idle the frequency (identified under Clocks (Core #0) noting that this is only measuring the first core) is lower than the nominal frequency.  From the previous posts we can identify that under the Balanced plan this is the impact of EIST (Enhanced Intel SpeedStep Technology) dynamically adjusting processor voltage and frequency.  In the example below the Reference or Base Clock (BLCK) is set at 100MHz (On K series processors as with this example this can be modified however on production based systems this  will be fixed) and the multiplier has been dynamically reduced to x 8 .0 resulting in a core frequency of 800 MHz.

 

 

ssp2.png



CPU-Z reports that the multiplier can be increased dynamically to x 39.0 which not surprisingly corresponds to a core frequency of 3.9GHz (which is the Max Turbo Frequency).  The actual multiplier will depend upon the number of active cores and in this example is x 37.0 for 4 active cores, x 38.0 for 3 active cores and x 39.0 for both 2 and 1 active cores and therefore depending on the overall processor load the peak turbo frequency will adjust dynamically above the maximum non turbo ratio which is x 35.0.


Given our T-SQL test running on a single core we can expect EIST to dynamically increase the multiplier and voltage (given power and temperature constraints) and as shown this increases up to the maximum Turbo Frequency of 3.9GHz

 

 

ssp3.png



This confirms that the current BIOS and operating system settings are correctly using Turbo Boost with a result in this case as follows:



Timing = 6406

Res = 873729.721235

(1 row(s) affected)


If we set the Power Plan to High Performance then the operating system requests that the CPU runs at the maximum frequency. The impact that this has on performance depends on the Processor, BIOS and Operating system.  Even if the full Turbo Mode frequency is reported by CPU-Z, all active states above the standard rated frequency are hardware controlled and C-States will still be entered when idle.  As this processor is based on the Haswell Microarchitecture with enhanced power management features it is not surprising that the results on this system are similar to that experienced with the Balanced Power Plan.



Timing = 6496

Res = 873729.721235

(1 row(s) affected)



Nevertheless by switching between Balanced and High Performance Power Plans you can run this routine and observe the impact on performance and determine what is best for your configuration.

 

As a further illustration in this example running the Power Saver Plan limits the processor to a maximum of 1Ghz and the stored procedure takes almost 28 seconds to complete:



Timing = 27950

Res = 873729.721235

(1 row(s) affected)



Clearly the impact on SQL Server performance is severe and therefore running in Power Saver mode is not advisable.  Instead testing between the Balanced and High Performance settings will indicate the optimal settings.  If on the other hand with this test you are not seeing the performance that you would normally expect (and typically with a modern processor this would be around 10 seconds, but could take longer depending on the CPU in question) then the next step is to troubleshoot the system BIOS settings.

 

After maximising your single threaded configuration you can then look to maximising CPU performance for multi-threaded workloads.  Typically with a scalable database such as SQL Server running a scalable database application will mean that the SQL Server workloads benefit in the same way that Oracle does from enabling Hyper-Threading  and therefore enabling Hyper-Threading should be the default option for scalability.  When enabling Hyper-Threading especially on systems with large core counts it is important to note that Windows Server 2008 and prior supported up to 64 CPUs only. Windows Server 2008 R2, 2012 and 2012 R2 supports up to 256 processors however divides these processors up into processor groups of 64 processors with handling of processor groups improving with each release of Windows.  For example on a test system with E7-4890v2 providing 120 logical processors Windows Server 2012 R2 presents these CPUs as 2 Processor Groups of 60 logical processors each.  SQL Server is able to operate across multiple Processor Groups however it is always a prudent approach with highly scalable workloads to ensure that sessions are evenly distributed across available CPU cores to ensure maximum performance.  For example the following screen taken from the metrics monitor of HammerDB while running a SQL Server workload illustrates a desirable distribution.

 

 

ssp4.png

 

 

For further troubleshooting I find these Open Source SQL Server Scripts useful.  Primarily designed to find I/O bottlenecks they are nevertheless useful to identify other resource bottlenecks as well. If you monitor a workload and these scripts report high wait events on the resource_type  of CPU under the resource_category of Threading and CPU then  it is worth investigating your Server Properties under processors to ensure that the workload is correctly balanced. Typically an automatic setting will be optimal however testing can confirm this for you.

 

 

ssp5.png



In this post I have looked at CPU single and multi-threaded performance applying some of the knowledge from Oracle on Linux workloads to ensure that SQL Server is similarly optimized to take advantage of the available CPU resources.

As the demand for database workload throughput increases along with server CPU core counts there can be an increasing challenge of maintaining scalability even with the most optimized workloads.  In fact in some scenarios following best practices can result in contention as more sessions running on more CPUs contend for access to the same SQL or data. In this blog post I will look at an extremely high throughput transactional Oracle database workload and the options available for improving concurrency and throughput without modifying the application.

 

The workload I will use is the HammerDB Oracle OLTP workload on a 4-socket Intel Xeon server - importantly the server CPU utilisation is over 95% and the throughput is already very high  so we need to be aware that contention may not "true" ie if the CPU is already fully utilised then contention may be as a result of a process waiting for CPU time as much as genuine contention on a particular object.  Nevertheless we are in the business of maximising as much as possible from our IT investments and it is interesting to see that even with full CPU utilisation there are options to reduce contention and improve performance still further.  This HammerDB workload is a good example of an optimized workload as it shares cursors, reduces parsing and makes use of bind variables. How it does this is when a workload is run, for each stored procedure a cursor is opened and as shown here in Oratcl the statement that calls the stored procedure is parsed once only:

 

set curn_sl [oraopen $lda ]

set sql_sl "BEGIN slev(:st_w_id,:st_d_id,:threshold); END;"

oraparse $curn_sl $sql_sl

return $curn_sl

 

When the stored procedure is called bind variables are used

 

orabind $curn_sl :st_w_id $w_id :st_d_id $stock_level_d_id :THRESHOLD $threshold

oraexec $curn_sl

 

using bind variables means that the same cursor can be shared between sessions, reducing the impact on the amount of memory needed for the shared pool and keeping hard parsing activity to a minimum: Running the workload and measuring the output with an AWR report shows the following top 10 foreground events:

Top 10 Foreground Events by Total Wait Time

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

                                            Total Wait    Wait   % DB

Event                                 Waits Time (sec) Avg(ms)   time Wait Class

------------------------------ ------------ ---------- ------- ------ ----------

DB CPU                                           34.2K           84.5

library cache: mutex X            1,443,752       2284       2    5.6 Concurrenc

cursor: pin S                     1,060,588     2004.2       2    5.0 Concurrenc

latch: In memory undo latch       1,036,240        401       0    1.0 Concurrenc

latch: enqueue hash chains          103,783      331.7       3     .8 Other

enq: TX - row lock contention       171,164      266.9       2     .7 Applicatio

db file sequential read             249,612      151.8       1     .4 User I/O

log file sync                        35,981      117.3       3     .3 Commit

db file scattered read              155,862      105.7       1     .3 User I/O

buffer busy waits                   270,967         90       0     .2 Concurrenc

 

Performance is good and CPU utilisation is maximised, however the top wait events of library cache: mutex X and cursor: pin S are related to the good practices of cursor sharing. Each of these events covers multiple mutexes and drilling down into the AWR report shows more detailed information on each.

                                                                         Wait

Mutex Type            Location                               Sleeps    Time (ms)

--------------------- -------------------------------- ------------ ------------

Cursor Pin            kksfbc [KKSCHLFSP2]                   634,827      933,897

Library Cache         kglpin1   4                           610,798      747,221

Library Cache         kglpndl1  95                          599,973      753,431

Library Cache         kglpnal1  90                          503,140      155,556

Cursor Pin            kksLockDelete [KKSCHLPIN6]            454,516      701,271

Library Cache         kglhbh1   63                                2            1

 

Here the top mutexes are related to child cursor lookup activity (kksfbc) and library cache pinning and unpinning activity (kglpin/kglpndl) - for further information on this there is an excellent presentation on Library Cache internals here.  Mutexes are extremely lightweight and use atomic instructions implemented at the CPU level with the LOCK CMPXCHG instruction, however a session sleeping waiting to acquire a mutex will do so utilising CPU time and therefore in this case the optimization has meant the high CPU and session count running the same stored procedures shows evidence of contention.

 

The tried and tested workaround is to modify the application to to artificially duplicate the SQL by changing the text. One such way to do this is by introducing 'dummy hints' that are not stripped out by the parser but are sufficient for Oracle to recognise that the statements are different and thereby reduces contention by having different sessions use different 'versions' of the same statements. This is not ideal however by modifying the application for a particular hardware platform and in some cases it is simply not an option to modify the application, instead, dbms_shared_pool.markhot can be used.  Instead of duplicating the SQL ourselves dbms_shared_pool.markhot serves this purpose introducing multiple versions of the same packages, procedures and cursors into the library cache for different sessions to use - to do this all we have to do is mark the objects in question hot.  We already know which packages and procedures are being used for this workload so can mark them hot at a SQL prompt as the SYS user as follows:

 

exec dbms_shared_pool.markhot(schema=>'SYS',objname=>'DBMS_RANDOM',NAMESPACE=>1);

exec dbms_shared_pool.markhot(schema=>'SYS',objname=>'DBMS_RANDOM',NAMESPACE=>2);

exec dbms_shared_pool.markhot(schema=>'SYS',objname=>'DBMS_OUTPUT',NAMESPACE=>1);

exec dbms_shared_pool.markhot(schema=>'SYS',objname=>'DBMS_OUTPUT',NAMESPACE=>2);

exec dbms_shared_pool.markhot(schema=>'TPCC',objname=>'NEWORD',NAMESPACE=>1);

exec dbms_shared_pool.markhot(schema=>'TPCC',objname=>'PAYMENT',NAMESPACE=>1);

exec dbms_shared_pool.markhot(schema=>'TPCC',objname=>'DELIVERY',NAMESPACE=>1);

exec dbms_shared_pool.markhot(schema=>'TPCC',objname=>'OSTAT',NAMESPACE=>1);

exec dbms_shared_pool.markhot(schema=>'TPCC',objname=>'SLEV',NAMESPACE=>1);

 

Re-running the workload shows a significant impact on the library cache: mutex X event and despite the high CPU utilisation already throughput has also improved.

 

Top 10 Foreground Events by Total Wait Time

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

                                            Total Wait    Wait   % DB

Event                                 Waits Time (sec) Avg(ms)   time Wait Class

------------------------------ ------------ ---------- ------- ------ ----------

DB CPU                                           34.5K           88.6

cursor: pin S                     1,071,740     2093.3       2    5.4 Concurrenc

enq: TX - row lock contention       373,472      548.9       1    1.4 Applicatio

latch: In memory undo latch         956,992      415.6       0    1.1 Concurrenc

latch: enqueue hash chains          103,312      355.7       3     .9 Other

db file sequential read             211,498      139.8       1     .4 User I/O

db file scattered read              168,610      123.5       1     .3 User I/O

log file sync                        36,418      118.7       3     .3 Commit

buffer busy waits                   257,958       95.9       0     .2 Concurrenc

library cache: mutex X               23,288       24.3       1     .1 Concurrenc

 

The detailed mutex information also shows the reduction in library cache mutex activity.

 

                                                                         Wait

Mutex Type            Location                               Sleeps    Time (ms)

--------------------- -------------------------------- ------------ ------------

Cursor Pin            kksfbc [KKSCHLFSP2]                   644,771      973,310

Cursor Pin            kksLockDelete [KKSCHLPIN6]            455,400      725,202

Library Cache         kglpin1   4                            10,879       11,115

Library Cache         kglpndl1  95                            8,330       10,836

Library Cache         kglpnal1  90                            6,959        1,857

Library Cache         kglhbh1   63                                1            0

 

Using dbms_shared_pool.markhot can be also be used for cursors and the most active ones can be seen within the view X$KGLOB -  there is a SQL statement from Andrey Nikolaev that shows the hottest objects in the library cache  here so we can modify that to find the top 4 cursors for the TPCC user and mark them hot as follows:


SET ECHO OFF

SET SERVEROUTPUT ON

DECLARE

BEGIN

FOR hash_id in (

select KGLNAHSV from (

select

case when (kglhdadr =  kglhdpar) then 'Parent' else 'Child' ||kglobt09 end,

kglhdadr ADDRESS,substr(kglnaobj,1,20),

parsing_schema_name, kglnahsh, KGLNAHSV, kglobtyd, kglobt23, kglobt24, kglhdexc, kglhdnsp

from x$kglob, v$sql

where x$kglob.kglnahsh = v$sql.hash_value

and v$sql.parsing_schema_name='TPCC'

and kglhdexc > 1

and kglobt24 > 1

order by kglobt24 desc) where rownum < 5

    ) LOOP

DBMS_OUTPUT.PUT_LINE('hash=>'|| hash_id.KGLNAHSV ||',namespace=>0');

SYS.DBMS_SHARED_POOL.MARKHOT(hash=>hash_id.KGLNAHSV,namespace=>0);

END LOOP;

END;

/

 

An example output is as follows - it is important to note that the hash value required is the full 16 byte hash value of KGLNAHSV in X$KGLOB although on some versions of Oracle no error is given if using the incorrect hash value however it has no effect so you should ensure that you are using the correct hash value as shown in the example: 

 

SQL> @markhot4sql

hash=>b45763c79b8f4f0e8439f6a3190f1b8f,namespace=>0

hash=>554b1093c9cd87db8f6f55016b20198a,namespace=>0

hash=>15e9d1f93391ce5edb7be748a2ab9f80,namespace=>0

hash=>6508e1443bf2a2948165d5ad91803582,namespace=>0

 

In this case cursor: pin S events have been reduced, however at the expense of increasing library cache mutex events and lowering throughput.

 

Top 10 Foreground Events by Total Wait Time

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

                                            Total Wait    Wait   % DB

Event                                 Waits Time (sec) Avg(ms)   time Wait Class

------------------------------ ------------ ---------- ------- ------ ----------

DB CPU                                           33.4K           79.7

library cache: mutex X              849,072     5296.8       6   12.6 Concurrenc

enq: TX - row lock contention       277,268      745.3       3    1.8 Applicatio

cursor: pin S                       222,045      455.4       2    1.1 Concurrenc

...

 

The detail shows however that it is a different area of library cache mutex x contention and from referring to the presentation noted previously kglhd is related to the handle structure and therefore creation of a new object.

 

Mutex Type            Location                               Sleeps    Time (ms)

--------------------- -------------------------------- ------------ ------------

Library Cache         kglhdgn1  62                        1,032,273    2,763,927

Cursor Pin            kksfbc [KKSCHLFSP2]                   202,786      338,137

Library Cache         kglhdgn2 106                          186,681      496,834

Library Cache         kglpin1   4                           127,693      190,311

Library Cache         kglpndl1  95                          126,592      200,987

Library Cache         kglpnal1  90                           83,421       19,982

Cursor Pin            kksLockDelete [KKSCHLPIN6]             56,208      100,403,


As a consequence with the system already at full CPU utilisation further optimization has proved counter

productive. In fact testing showed that even when marking packages and procedures hot only and limiting the number of copies created with the following parameter resulted in the optimal overall system performance.


SQL> show parameter kgl

 

NAME                     TYPE     VALUE

------------------------------------ ----------- ------------------------------

_kgl_hot_object_copies             integer     4


In summary what we have seen is that if experiencing contention on library cache: mutex X or cursor: pin S events then dbms_shared_pool.markhot can be a highly effective tool.  Whether looking to achieve a few extra percent on an extreme high performance workload or to achieve high scalability on systems with very high core counts Oracle on Intel makes this possible all without without needing to modify the application. 

In recent blog posts I have talked about optimizations of Oracle on Linux for both CPU and I/O with SSDs, in this post I will complete the picture and focus on configuring memory.

 

In my post how to Maximise CPU Performance for the Oracle Database on Linux I stated that the best place to start from CPU information is http://ark.intel.com.  Not surprisingly there is a similar tool to help you configure memory here http://imct.intel.com/ and this can help if you are currently sizing a system or wish to find out the capabilities of the system you already have.  To find out about your hardware as described by your BIOS you can use the command dmidecode . This well tell you information such as processors, BIOS version and system board as well as the memory that is currently populated, for example:

 

# dmidecode 2.11

SMBIOS 2.6 present.

Memory Device

      Size: 8192 MB

      Form Factor: DIMM

      Set: None

      Locator: DIMM_A1

      Bank Locator: NODE 0 CHANNEL 0 DIMM 0

      Type: DDR3

      Type Detail: Synchronous

      Speed: 1600 MHz

 

Referring to the dmidecode output and memory configuration tool and cross referring the capabilities of my processor and system board I can see that my balanced configuration of 128GB RDIMM 1600 Dual Rank (16 x 8GB) 1.5v can run at 1600 MT/s which is  PC3-12800 or a peak data rate of the module of 12.8GB/sec.  As the memory controller is on the CPU I can also reference the Memory Specifications sections from ark.intel.com to see that with the E5-2680 there is a maximum of 4 memory channels per socket making it clear from the memory specifications why the Max Memory Bandwidth per socket is given as 51.2GB/s (You can run the STREAM benchmark to test your own system).  ark.intel.com also tells us that in a 2 socket configuration with a system board with the maximum number of slots the maximum memory configuration is 768GB. Rather than having to manually size the memory configuration tool gives you the option to determine both the correct memory and possible levels of capacity and bandwidth based on the processor and system board information you have given.  Also checking on ark.intel.com for the E7 processor shows a current memory capacity of 2TB for a 4 socket configuration and 4TB for 8 sockets however there are no entries for E7 systems on imct.intel.com. In contrast to the four channel integrated memory controller on the E5 processor the E7 features two on-die memory controllers managing Scalable Memory Interconnect (SMI) channels connecting to Scalable Memory Buffers enabling up to 16 DIMMs per socket.  This architecture enables the maximum capacity to run at the maximum rated frequency which ark.intel.com has told us is 1066 MHz. Consequently for E7 the memory configuration tool is not required.

 

I’ve mentioned integrated memory controllers and memory capacity being dependent on the number of sockets and this indicates that Xeon systems have a NUMA configuration meaning that memory access times or latency is lower for a CPU accessing its local memory as opposed to the memory controlled by the remote CPU as shown.  LMbench is an example of tool that can be used to measure the different access times.


memoryconfig1.png

 

The CPUs communicate via the QPI (Quick Path Interconnect) link (on E5 2 socket there are 2 connecting the CPUs) and it is important to distinguish between the QPI links and the memory channels as shown above which can often a source of confusion.  Given the NUMA configuration of the platform this means that anyone running Oracle on Linux on Xeon needs to have some awareness of NUMA regarding their Linux and Oracle configurations.


Taking the Linux configuration first in my post on configuring redo for ASM I mentioned running the Oracle Validated/Pre-Install RPM with yum to configure the operating system.  One of the settings that this makes is to add the boot parameter “numa=off”. However as we have already seen the hardware itself is NUMA so how is it possible to turn it off?  Similarly as well as this option at the Linux level there may also be an option in the BIOS to turn NUMA off.  Whether at the BIOS or operating system at different levels what this option does is to interleave memory between the memory controllers on the system. The system gives the impression of having a single memory node however memory accesses are evenly distributed across all of the controllers in the system without awareness being required at the application level.  The reason for setting this parameter is a reflection of the fact that the difference between the latency of local and remote memory access measured in nanoseconds means that in on 2 and 4 socket systems the differences between the gains of enabling (and configuring) or disabling NUMA awareness are typically within 1% even when measured in a high performance environment. However going to 8 sockets and beyond may see gains of approximately 5% and enabling and correctly configuring NUMA awareness on all systems typically delivers the best performance so it is well worth knowing what this involves even if you decide to remain with the default configuration.

 

Having decided to enable NUMA, edited grub.conf removed “numa=off” and rebooted I can see that my 2 socket system is now running in NUMA mode:

 

[root@sandep1 ~]# dmesg | grep -i numa

NUMA: Initialized distance table, cnt=2

NUMA: Node 0 [0,c0000000) + [100000000,1040000000) -> [0,1040000000)

pci_bus 0000:00: on NUMA node 0 (pxm 0)

pci_bus 0000:80: on NUMA node 1 (pxm 1)

 

Using the numactl command I can also see how the memory is allocated now between 2 memory nodes and the CPUs associated with these nodes.

 

[root@sandep1 ~]# numactl --hardware

available: 2 nodes (0-1)

node 0 cpus: 0 1 2 3 4 5 6 7 16 17 18 19 20 21 22 23

node 0 size: 65459 MB

node 0 free: 8176 MB

node 1 cpus: 8 9 10 11 12 13 14 15 24 25 26 27 28 29 30 31

node 1 size: 65536 MB

node 1 free: 8447 MB

node distances:

node   0   1

  0: 10  21

  1: 21  10

 

We can now proceed to configure Oracle for NUMA however firstly it is worth looking into a feature called Huge Pages. The default memory page sized is determined by the CPU architecture and on x86 systems this page size is 4KB.  On Linux the page size is defined by  1 << PAGE_SHIFT and as PAGE_SHIFT by default  is 12,  212 gives us the expected 4KB.  Within the virtual memory addressing these lowest 12 bits of the address are used as an offset into this 4KB page.  If we change this offset to the next boundary of 21 bits used as the index into the page table for 4KB pages we can enable Huge Pages which are now 221 = 2MB. (1GB Huge Pages are also supported by current CPUs by further moving the offset). The main benefits of Huge Pages are firstly that the Translation Lookaside Buffer (TLB) that caches virtual to physical address mappings can cache the addressing of more of the Oracle SGA with larger pages and secondly the amount of memory used for page tables is reduced. Within Linux with standard pages a separate page table is maintained for each process. With a large Oracle SGA and a large number of processes this amount of memory used for page tables can be large and therefore using Huge Pages significantly reduces this overhead.  Huge Pages can be allocated with the parameter vm.nr_hugepages in /etc/sysctl.conf. For example to allocated 55000 2MB pages the following would be used and allocated at boot or by running sysctl –p to apply the changes.

 

vm.nr_hugepages=55000 

 

The allocated 55000 2MB pages are pinned in memory awaiting use by Oracle and can be viewed in /proc/meminfo.


HugePages_Total: 55000

HugePages_Free: 55000

HugePages_Rsvd:        0

HugePages_Surp:        0

Hugepagesize: 2048 kB

DirectMap4k: 6144 kB

DirectMap2M: 2058240 kB

DirectMap1G: 132120576 kB

 

The importance of Huge Pages in the context of this discussion is that the Huge Page allocation is also and separately NUMA aware.  Huge Pages are evenly distributed between the available memory nodes and as a consequence if NUMA is enabled at the operating system and Huge Pages are used then even without additional Oracle NUMA parameters the Oracle SGA will be evenly distributed between the memory nodes.  This can be thought of as being similar to setting the boot parameter numa=off but instead taking effect at a higher granularity.

When using Huge Pages you should set the Oracle parameter use_large_pages = ONLY. This ensures that the instance will start only if sufficient Huge Pages are available and report information on usage to the alert log.

 

************************ Large Pages Information *******************

Parameter use_large_pages = ONLY

Per process system memlock (soft) limit = 141 GB

Total System Global Area in large pages = 88 GB (100%)

Large pages used by this instance: 44844 (88 GB)

Large pages unused system wide = 10156 (20 GB)

Large pages configured system wide = 55000 (107 GB)

Large page size = 2048 KB

****************************************************************

 

It can be seen that the Oracle reported Huge Pages configuration corresponds to the information available from Linux.

 

[root@sandep1 ~]# cat /proc/meminfo | grep -i huge

AnonHugePages:    108544 kB

HugePages_Total:   55000

HugePages_Free:    10265

HugePages_Rsvd:      109

HugePages_Surp:        0

Hugepagesize:       2048 kB

 

The Huge Pages are evenly distributed so by looking just at node 0 we can see exactly half of the requested huge pages have been allocated from here.

 

[root@sandep1 node0]# pwd

/sys/devices/system/node/node0

[root@sandep1 node0]# cat meminfo

Node 0 AnonHugePages:     16384 kB

Node 0 HugePages_Total: 27500

Node 0 HugePages_Free:   5243

Node 0 HugePages_Surp:      0

 

After this discussion of Huge Pages you may be aware of a Linux feature called Transparent Huge Pages and be thinking that all of the benefits described previously will happen automatically by leaving this feature enabled. However at the time of writing Transparent Huge Pages do not support shared memory or in other words do not support the memory required for the Oracle SGA, for this reason for an Oracle on Linux system you should set the boot parameter transparent_hugepages=never to disable this feature and if using Huge Pages continue to configure them manually.

 

Whether using Huge or standard pages you can also enable NUMA awareness with Oracle by setting the parameter _enable_NUMA_support = TRUE.  When enabled this can be viewed in the Oracle alert log

 

NUMA node details:

  OS NUMA node 0 (16 cpus) mapped to Oracle NUMA node 0

  OS NUMA node 1 (16 cpus) mapped to Oracle NUMA node 1

 

and has the impact of binding core processes to NUMA nodes

DBW0 started with pid=12, OS id=3902 , bound to OS numa node 0

Starting background process DBW1

DBW1 started with pid=11, OS id=3904 , bound to OS numa node 1

Starting background process DBW2

DBW2 started with pid=14, OS id=3906 , bound to OS numa node 0

Starting background process DBW3

DBW3 started with pid=13, OS id=3908 , bound to OS numa node 1

Starting background process LGWR

LGWR started with pid=15, OS id=3910 , bound to OS numa node 1

 

allocating the SGA with locality considerations

 

Area #2 `NUMA pool 0' containing Subareas 3-3

  Total size 0000000a10000000 Minimum Subarea size 10000000

  Owned by: 0

   Area Subarea    Shmid      Stable Addr      Actual Addr

      2 3 25100291 0x00000080000000 0x00000080000000

               Subarea size     Segment size  

                          0000000a10000000 0000000a10000000

Area #3 `NUMA pool 1' containing Subareas 4-4

  Total size 0000000a10000000 Minimum Subarea size 10000000

  Owned by: 1

   Area Subarea    Shmid      Stable Addr      Actual Addr

      3 4 25133060 0x00000a90000000 0x00000a90000000

               Subarea size     Segment size  

 

as well as binding listeners to processors.

 

Setting  _enable_NUMA_support on a NUMA enabled system means Oracle will operate on a NUMA aware basis and as systems increase in size and socket counts deliver efficiencies that increase performance. 

 

Once the system is configured you may also want to monitor the memory  performance to ensure that everything has been configured correctly. To do so I would recommend the Intel Performance Counter Monitor , one of the tools included within this package enables you to observer your memory performance in a similar way to iostat is used to monitor I/O.


memoryconfig2.png


In the example shown above it can be seen that taking the approach described in this post and evenly distributing the Oracle SGA across the memory nodes in a system you can achieve an optimal and balanced Oracle configuration.

If you are using the Oracle Database and interested in encryption you should already using Oracle Transparent Data Encryption (TDE). I have already put 2 blog posts on how to acclerate TDE with Intel AES-NI here Real World Database Encryption Performance with Intel AES-NI Pt 1 and here Real World Database Encryption Performance with Intel AES-NI Pt 2.

To see how this works in a real world environment Dupont conducted a Proof of Concept with Oracle TDE using Intel AES-NI and in this white paper Increasing Security by Accelerating Data Encryption with Intel® Advanced Encryption Standard New Instructions (Intel® AES-NI): Testing by DuPont and Intel Corporation you can see how Dupont improved Encryption and Decryption performance by 300%.

Oracle Openworld 2013 is only a few weeks away and this year I will be speaking at a session along with Hubert Nueckel - Technical Lead, Intel Corporation and Saraswathy Narayan - Senior Software Development Manager, Oracle on Oracle 12c performance. We will cover areas such as 

inside information on work areas leading up to Oracle Database 12c, optimal settings and tuning Oracle and Intel use to configure Intel systems and Intel and Oracle’s engagement with regard to Oracle Exadata. If you run Oracle on Intel or are considering modernizing your Oracle environment we hope to see you at the following session:

 

Session ID: CON10301
Session Title: How to Speed Up Your Oracle Database 12c Instance: Performance Engineered for Intel Technology
Venue / Room: Moscone South - 301
Date and Time: 9/25/13, 15:30 - 16:30

 

While at Oracle Openworld 2013 you can also Visit Intel in the Oracle Engineered Systems Showcase and talk to Intel experts in the Moscone North Hall Lower Lobby. I'll be there and also talking in the booths in the exhibition area of some our partners on Oracle database performance on Intel. I hope to see you there.

No self-respecting post on How to Maximise CPU Performance for the Oracle Database on Linux would be complete without also mentioning Hyper-Threading(HT). This page describes HT perfectly as a feature that “uses processor resources more efficiently, enabling multiple threads to run on each core.” And is an implementation of a feature called Simultaneous Multithreading (SMT).  In more details the figure shows 2 threads being scheduled on a processor, firstly without HT and then with HT illustrating fewer processor execution units sitting idle when HT is enabled.

 

HTBlog_pic1.png

 

From this description you can see that  HT enables 2 threads to be executing on the same processor core at the same time and thereby improves performance by improved efficiency. It does not increase the number of physical cores available and whether HT is enabled or disabled does not impact the Oracle licence cost which is calculated based on the number of cores.

 

As always the best place to start to see if you can benefit from a CPU features is http://ark.intel.com. With an

example E5-2680 we can see that HT is available.

 

HT_Blog_Pic2.png

 

Ensuring that HT is enabled at the BIOS once the Linux operating system started you can see double the number of processors in your processor listing. In this example system with 2 x E5-2680 there are 16 physical cores however the system now sees 32 logical processors.

 

[root@sandep1 ~]# cat /proc/cpuinfo | grep processor

processor   : 0

processor   : 1

processor   : 2

processor   : 3

processor   : 28

processor   : 29

processor   : 30

processor   : 31

 

You can discover some more about the configuration by using the Processor Topology Tool . Running this tool on the example system tells us that there are 2 packages, 16 processors and 32 logical processors. Also it tells us which Operating System CPU corresponds to which logical processor so core 0 and thread 0 are OS processor 0 however Core 0 and thread 1 are OS processor 16. You can see that what Linux does is number the first 16 OS processors as thread 0 on the CPUs and then thread 1 for processors 16 to 31.  Scheduling threads across the processor count ensures that the maximum number of physical resources are used first before then using HT or as using the initial physical processor as thread 0 and an additional logical processor as thread 1.

 

[root@sandep1 oracle]# ./cpu_topology64.out

Software visible enumeration in the system:

Number of logical processors visible to the OS: 32

Number of processor cores visible to this process: 16

Number of physical packages visible to this process: 2

Hierarchical counts by levels of processor topology:

# of cores in package 0 visible to this process: 8 .

# of logical processors in Core 0 visible to this process: 2 .

OScpu    

0        16|1 17|2        18|3        19|4 20|

5        21|6 22|7        23|

Core

c0_t0 c0_t1|c1_t0 c1_t1|c2_t0 c2_t1|c3_t0 c3_t1|c4_t0 c4_t1|

c5_t0 c5_t1|c6_t0 c6_t1|c7_t0 c7_t1|

 

 

Oracle is also aware of the distinction between available CPUs and physical cores as shown within an AWR report.

 

 

Host Name

Platform

CPUs

Cores

Sockets

Memory (GB)

sandep1.example.com

Linux x86 64-bit

32

16

2

  1. 62.99

 

 

As always to prove the benefits of any feature beyond doubt I always run tests. To do so with HT is especially straightforward – what we need to do is to put a system under test conditions, turn HT off and run the test and then turn it on and run the test. The difference between the 2 will be the impact of HT. For the test I will use HammerDB, as the documentation shows  the tests are fully repeatable so we can be sure that the impact between the 2 tests are attributed to HT. Reading the introduction to transactional testing guide will also help understand the results here. I took the E5-2680 system installed it with Oracle 12c and repeated the same test unattended with the autopilot feature with HT on and off. The performance profile looking from left to right showed an increased system load by increasing the number of virtual users up to a point full system utilisation. The actual performance values are not shown and the bar graph shows the point of peak performance. The red bar shows HT off and the blue line HT on.

 

HTBlogPic3.png

With HT enabled we can scale beyond the level reached by the HT off configuration and in this example with this workload on this system the performance is 1.35X or 35% higher by enabling HT. Additionally for these increased numbers of virtual users transaction response times will also be improved as we can achieve a higher level of throughput.  (You can measure these response times with HammerDB).

The important factor to note is the improvement in scalability from enabling HT equivalent at full utilisation in this case to the throughput e would expect from more than additional 5 physical cores, the test gives compelling evidence that overall HT can improve your Oracle performance.


I have heard of recommendations to turn off HT which put in context of this testing does seem unusual however it is possible to see how this could happen. With this example we already know it is highly scalable and predictable and the system is well balanced configured, in other words if we add more CPU resource we can take advantage of it and therefore can take advantage of HT. However in some scenarios there can be issues with either the configuration or the application where it cannot take advantage of more CPU resource. For example there will be a single point of contention such as Oracle latch contention and adding even more sessions requesting the same latch can cause the system to slow even further and throughput to lower. Similarly insufficient or misconfigured I/O, memory or network can also suffer by adding additional CPU resource to a workload. We have discussed transactional workloads in this post however Parallel Query is a type of workload where this type of scenario comes to mind by increasing the number of parallel query threads beyond the system capability.

In these scenarios removing physical CPUs from a server for testing is not a straightforward task however turning off HT at the BIOS is. In these circumstances it is possible to see how turning off HT could be seen to solve the problem however in reality system configuration or application scalability is likely to be the cause. Putting Oracle (and SQL Server) aside with some other databases, database scalability may also be an issue, for example Redis is a single-threaded database and therefore a single instance will not benefit from HT although enabling HT could mean that you could run more instances on a single system.

In summary, Hyper-Threading is a performance and scalability feature that can be of significant benefit in an Oracle environment. This example of a scalable application shows a performance improvement of 35% by using the same number of cores more efficiently. The actual benefit you see will depend on application scalability and overall system configuration to take advantage of CPU resource when available. In an Oracle environment if you have HT available I recommend enabling this feature and if you receive a recommendation to turn it off to spend some time investigating configuration or application issues to see why you not benefiting by adding CPU resource to the system.

In my previous post How to Maximise CPU Performance for the Oracle Database on Linux I looked at some of the settings related to CPU performance and in this post I will look at some tests that prove out the concepts discussed in the earlier post. To do that I am going to use the test introduced here Engineered for Innovation at OOW11 - are you using Turbo Boost for your Oracle Database?

This Oracle PL/SQL routine from Julian Dyke can be run in any Oracle instance on any system and reports a result that should always be 873729.72 as well as an elapsed time.


  1. SET SERVEROUTPUT ON
  2. SET TIMING ON
  3. DECLARE
  4.   n NUMBER := 0;
  5. BEGIN
  6.   FOR f IN 1..10000000
  7.   LOOP
  8.     n := MOD (n,999999) + SQRT (f);
  9.   END LOOP;
  10.   DBMS_OUTPUT.PUT_LINE ('Res = '||TO_CHAR (n,'999999.99'));
  11. END;
  12. /

 

The aim is to observe how fast the Oracle single thread performance is by achieving a lower elapsed time as possible. For example with the 2011 post  the Intel Xeon Processor 5680 completed the PL/SQL test ran in 8.89 secs with turbo boost off (at the BIOS) and 8.21 with it enabled (again at the BIOS level). Therefore we can also use the same test to observe the impact of different C-state settings on Turbo Boost for a single threaded workload. After this we can use a HammerDB workload to observer similar data on a multi-threaded workload. But firstly we will look at the PL/SQL test along with some of the turbostat data on the E5-2680 system.

 

As a baseline I turned turbo boost off by setting the EIST BIOS setting to off.  Running the PL/SQL test showed the frequency at a maximum of 2.7GHz without turbo boost frequency.

 

pkg core CPU   %c0  GHz  TSC  %c1  %c3  %c6  %c7

  1    0 24 100.00 2.69 2.69 0.00 0.00 0.00 0.00  

 

And the completion time for the baseline test was 10.45 seconds.

Res = 873729.72

PL/SQL procedure successfully completed.

Elapsed: 00:00:10.45

 

I then re-enabled EIST for all subsequent tests.

 

In the previous post I briefly mentioned that in an Oracle workload you would not wish to set the In-kernel governor  to powersave mode and the PL/SQL test shows why. In this instance the frequency for the core that the test is run on does not rise above 1.2Gz

 

pkg core CPU %c0    GHz  TSC  %c1  %c3  %c6  %c7

  1    2 10 100.00 1.20 2.69 0.00 0.00 0.00 0.00  

 

And the elapsed time is over 23 seconds close to 3X more than the performance seen in the 2011 post.

 

Res = 873729.72

PL/SQL procedure successfully completed.

Elapsed: 00:00:23.50

 

For subsequent tests I therefore set the In-kernel governor and energy performance bias settings to performance.

Firstly leaving the system settings at the default letting intel_idle manage the C-state settings we see that when idle the system makes full use of deeper C-states.

 

pkg core CPU  %c0  GHz  TSC    %c1 %c3    %c6   %c7

   0   0 0   0.11 2.42 2.69   5.39 0.31  0.00  94.20  

   0   1 1   0.10 2.72 2.69   1.64 0.84  0.00  97.41  

   0   2 2   0.02 2.93 2.69   0.03 0.00  0.00  99.95  

   0   3 3   0.00 2.41 2.69   4.24 0.00  0.00  95.76

 

And the PL/SQL routine consistently completes consistently in 8.01 seconds. (subsequent elapsed times are just shown once)

 

Res = 873729.72

PL/SQL procedure successfully completed.

Elapsed: 00:00:08.01

 

Res = 873729.72

PL/SQL procedure successfully completed.

Elapsed: 00:00:08.01

 

I then set the kernel parameter intel_idle.max_cstate=3, rebooted, started Oracle and re-ran the test which completed marginally faster.

 

Res = 873729.72

PL/SQL procedure successfully completed.

Elapsed: 00:00:07.97

 

Looking at turbostat showed turbo boost was being used and this time the deepest C-State was C6.

 

pkg core CPU %c0   GHz  TSC   %c1 %c3  %c6    %c7

   0   0 0   0.12 3.44 2.69  0.50 0.01 99.37  0.00  

   0   1 1  99.94 3.45 2.69  0.06 0.00  0.00  0.00  

   0   2 2   0.08 3.38 2.69  0.15 0.01 99.76  0.00

 

Repeating the test with  intel_idle.max_cstate=2 produces a very similar elapsed time and the deepest C-State as expected is C3.  We can see that whilst one core is in turbo boost other cores are at deeper C-States.

 

Res = 873729.72

PL/SQL procedure successfully completed.

Elapsed: 00:00:07.98

 

pkg core CPU %c0   GHz  TSC   %c1   %c3  %c6    %c7

   0   0 0   0.21 3.35 2.69  4.67  95.12 0.00   0.00  

   0   1 1  93.33 3.49 2.69  0.13   6.54 0.00   0.00  

   0   2 2   0.22 3.33 2.69  1.42  98.36 0.00   0.00  

 

I then moved  to  intel_idle.max_cstate=1 and now this produces an elapsed time that is almost 13% lower than previous.

 

Res = 873729.72

PL/SQL procedure successfully completed.

Elapsed: 00:00:09.00

 

Turbostat shows that the max turbo frequency is now 3.09GHz again almost 13% lower than previous and the deepest C-State is C1.

 

pkg core CPU %c0   GHz  TSC    %c1  %c3  %c6  %c7 

   0   0 0   0.61 3.07 2.69  99.39 0.00 0.00 0.00  

   0   1 1   1.12 3.09 2.69  98.88 0.00 0.00 0.00

   0   2 2  92.89 3.09 2.69   7.11 0.00 0.00 0.00  

 

Setting intel_idle.max_cstate=0 processor.max_cstate=0 produces almost the same result as the previous test.

 

Res = 873729.72

PL/SQL procedure successfully completed.

Elapsed: 00:00:09.00

 

pkg core CPU %c0   GHz  TSC    %c1  %c3  %c6  %c7  

   0   0 0   0.50 3.09 2.69  99.50 0.00 0.00 0.00  

   0   1 1 100.00 3.09 2.69   0.00 0.00 0.00 0.00

   0   2 2   0.63 3.09 2.69  99.37 0.00 0.00 0.00 

 

Returning back to kernel parameter intel_idle.max_cstate=3 I then ran a schema creation with HammerDB to simulate a full Oracle workload and monitored the system using turbostat. Whilst building the schema with all cores active we achieve a turbo boost frequency of 3.09GHz on all cores.

 

pkg core CPU %c0   GHz  TSC %c1  %c3  %c6  %c7 

   0   0 0  99.27 3.09 2.69 0.73 0.00 0.00 0.00  

   0   1 1  99.25 3.09 2.69 0.75 0.00 0.00 0.00  

   0   3 3  98.93 3.09 2.69 1.07 0.00 0.00 0.00  

   0   4 4  98.95 3.09 2.69 1.05 0.00 0.00 0.00  

   0   5 5  99.23 3.09 2.69 0.76 0.00 0.01 0.00  

   0   6 6  99.04 3.09 2.69 0.94 0.01 0.00 0.00  

   0   7 7  99.28 3.09 2.69 0.72 0.00 0.00 0.00  

   1   0 8  98.95 3.09 2.69 0.94 0.02 0.09 0.00  

 

And once the build is complete and the single threaded DBA workload of creating indexes and gathering statistics takes place this is boosted to the full max turbo frequency of the CPU.

 

pkg core CPU %c0   GHz  TSC  %c1  %c3  %c6   %c7

… 

   0   0 0  99.87 3.48 2.69  0.13 0.00 0.00  0.00  

   0   1 1   0.75 3.46 2.69 99.25 0.00 0.00  0.00  

   0   2 2   0.67 3.36 2.69  4.05 0.42 94.86 0.00  

 

In summary what we have observed correlates bears out the findings from the post How to Maximise CPU Performance for the Oracle Database on LinuxIf we disable or limit C-State settings we are very likely to impact the systems ability to make use of Turbo Boost Technology up to its full potential.  We also saw that Oracle makes use of both single and multi-threaded workloads and with the correct settings turbo boost provides the ideal adaptation to provide performance exactly where you need it. At the very least if you change high level BIOS or kernel parameter settings you should find out what the settings change and always test single and multi-threaded workloads such as the example shown here to ensure that you get the best CPU performance in your Oracle environment.

If you look at the keynote slides from the recent Reimagine the Data Center event, Intel announced some of the products from its “Low Power Product Direction” with impressive figures for how low the power consumption now goes for both Xeon and Atom processors. Nevertheless in mission critical environments some Oracle users ask for maximum performance configurations at the expense of some additional power usage. In this scenario the users will select a BIOS option such as shown and assume the system will give the best possible performance all of the time.


maxperf.png


However, they may be surprised to find that setting the “Maximum Performance” option may not actually be the best choice and could actually result in lower performance in some scenarios. With Oracle for example a busy OLTP environment may require high performance across all cores at the same time, whereas if DBA is building indexes or gathering statistics they may prefer to boost the performance for a single thread or process for that one job. Power and performance are closely intertwined and Intel processors give you fine grained control in exactly how you balance the system for both saving power and high levels of performance. In this post I will look at some of the CPU technologies such as Enhanced Intel SpeedStep Technology (EIST), Intel® Turbo Boost Technology, C-states and -Pstates and the Linux In-Kernel Governors and cpuidle drivers. In a subsequent post I will run some tests with an Oracle database on Linux to try out some of the configurations available.

 

The best place to start is http://ark.intel.com. Find your CPU model from /proc/cpuinfo as shown and look up the specifications, here is the example from my test system:

 

# cat /proc/cpuinfo | grep "model name"

model name  : Intel(R) Xeon(R) CPU E5-2680 0 @ 2.70GHz


e52680.png


The first thing we can observe is that the Clock Speed and Max Turbo Frequency are different. In fact the Max Turbo Frequency is much higher than the Clock Speed so we need to look at some of the technologies surrounding Turbo Boost to see if selecting “Maximum Performance” really is the best option.

 

There is some introductory information on Turbo Boost here – at the highest level this tells us 2 important aspects Firstly Turbo Boost is activated when the OS requests the highest processor performance state (P0) and secondly the actual frequency reached is dependent on algorithms evaluating the number of active cores and the processor current, power and temperature. In turn this tells us that we need to look into Performance States or P-States and also that the performance of one core on a processor is also partly dependent on the settings and workload on the other cores on the same processor meaning that we also need to look at the processor operating states or C-states.

 

P-states are voltage/frequency pairings controlled through Enhanced Intel SpeedStep Technology (EIST) and therefore If you disable EIST at the BIOS then you should also find that your BIOS disables any options to use Turbo Boost. With our example processor the “Max Turbo Frequency” is state P0, “Clock Speed” is P1 and other pairings are defined within the range P1 to Pn.  The ranges P1 to Pn are controlled by the Linux operating system and the active configuration can be seen in /proc/cpuinfo for example the extract below shows 2 cores on the test system at different P-States.

 

[root@sandep1 ~]# cat /proc/cpuinfo | grep -i mhz

cpu MHz           : 1200.000

cpu MHz           : 2701.000

….

 

In contrast the state P1 to P0 is hardware controlled and the frequency achieved by a core dependent on the factors mentioned previously up to the Max Turbo Frequency, this frequency is not seen through the /proc/cpuinfo interface.  In a Linux environment P-States are controlled by the In-Kernel Governors. By default the ondemand governor is active and the powertop utility shows the available and active frequencies across the system.

 

PowerTOP version 1.11      (C) 2007 Intel Corporation

Cn Avg residency       P-states (frequencies)

C0 (cpu running)        ( 8.7%)         2.71 Ghz    25.0%

polling           0.0ms ( 0.0%)         2.50 Ghz     0.0%

C1 mwait          0.2ms ( 0.1%)         2.00 Ghz     0.6%

C2 mwait          0.6ms ( 0.1%)         1500 Mhz     0.0%

C3 mwait          0.0ms ( 0.0%)         1200 Mhz    74.4%

C4 mwait         12.5ms (91.0%)

 

You can choose read and set your governor on per CPU basis as follows:

 

cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor

echo "ondemand" > /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor

 

or across the whole system with  a script. Alternatively the package cpufrequtils is available on the install media with the commands cpufreq-info and cpufreq-set , however  if using this utility I recommend manual checking to ensure that you get the settings you have chosen.

 

In addition to ondemand you have the powersave governor. This sets the frequency to the lowest available in this case 1200 MHz. In an Oracle environment it is highly unlikely that you would wish to do this. More importantly you also have the performance governor. By setting the performance governor for all CPUs in this looking in /proc/cpuinfo in this example will show 2701.000 MHz for all CPUs and using powertop you can see a state given as 100% Turbo Mode.

 

PowerTOP version 1.11      (C) 2007 Intel Corporation

Cn Avg residency       P-states (frequencies)

C0 (cpu running)        ( 0.0%)       Turbo Mode   100.0%

polling           0.5ms ( 0.0%)         2.71 Ghz     0.0%

C1 mwait          0.2ms ( 0.1%)         2.50 Ghz     0.0%

C2 mwait          0.4ms ( 0.1%)         2.40 Ghz     0.0%

C3 mwait          0.0ms ( 0.0%)         2.31 Ghz     0.0%

C4 mwait        14.7ms (110.1%)

 

In other words the operating system is asking for state P0 for all CPUs and passing control to the hardware to activate Turbo Boost dependent on the factors noted previously up to its full range. Those wishing to achieve “Maximum Performance” with minimum latency of switching between P-States can choose the Performance Governor, those wishing to also save power can remain with the default governor of ondemand.

 

From the Powertop output we can also see additional CPU operating states called C-States. As powertop shows C-State C0 is a state when the CPU is running and therefore all P-States are active within C-State C0, all other C-States C1-Cn show when the CPU is idle and not executing instructions. To save energy parts of the CPU can be powered down and a deeper C-state has greater power savings however there are greater latencies to then return to C0.  In similarity to EIST, you can set C-States at the system BIOS and interestingly checking the manual for the system with the “Maximum Performance” BIOS has the entry.

 

Minimum Processor Idle Power State = No C-states

 

However as we have seen our hardware controlled Turbo Frequency of P0 is partly dependent on the number of “active” cores or with the further detail we now have the C-state residency of those cores. Therefore choosing “No C-states” could reduce the ability to reach the full potential Max Turbo Frequency and therefore reduce potential performance in some scenarios. 

 

We do need further information however as even though you may have chosen “No C-states”  at the BIOS you will find that without additional kernel parameters an up to date Linux OS on Intel Processors will go ahead and use them anyway. This is because by default the cpuidle driver that manages the processor idle state is intel_idle which will utilise C-States irrespective of the BIOS settings.  You should see a message in your dmesg output such as:

 

using mwait in idle threads

ACPI: acpi_idle yielding to intel_idle

 

Using the turbostat utility from the pmtools power management package you can observe the particular C-state that is active on a core. For simplicity I have truncated that output showing the package C-States to only show the core C-States.

 

pkg core CPU   %c0   GHz  TSC   %c1    %c3    %c6   %c7  

               0.04 2.31 2.69  20.19   0.11   0.00  79.66

   0   0   0   0.11 2.42 2.69   5.39   0.31   0.00  94.20  

   0   1   1   0.10 2.72 2.69   1.64   0.84   0.00  97.41

   0   2   2   0.02 2.93 2.69   0.03   0.00   0.00  99.95

   0   3   3   0.00 2.41 2.69   4.24   0.00   0.00  95.76

   0   4   4   0.00 2.30 2.69  12.34   0.00   0.00  87.66  

 

You can see that by default when idle even though the performance governor has been set the deep C-state C7 has been entered on all of the cores when idle.  If looking to control which C-states are entered instead of the BIOS setting you can use the kernel parameter intel_idle.max_cstate. Note that the C-States are defined to the ACPI standard so with the example processor setting the parameter intel_idle.max_cstate=2 will ensure that the processor goes no deeper than C-State 3, %c3 when viewed in turbostat and similarly intel_idle.max_cstate=3 ensures no deeper than C-State 6, %c6.  You can observe the difference in latencies measured in microseconds in /sys/devices/system/cpu/cpu0/cpuidle for a particular C-State.

 

[root@sandep1 cpuidle]# cat ./state2/latency

80

[root@sandep1 cpuidle]# cat ./state3/latency

104

 

As we saw previously the mwait instruction is used in idle threads and this means C-State C1 will be entered whenever the core is idle irrespective of other settings. Setting intel_idle.max_cstate=0 however disables the use of intel_idle (it does not restrict the C-States to C0) and falls back to acpi_idle however acpi_idle is not aware of individual core C-states and only at the package level.  When using mwait even when  intel_idle is disabled C1 will be used and only adding an additional parameter such as “idle=poll” will keep the CPU in state C0 at all times using considerably more power to do so.

 

pkg core CPU   %c0   GHz  TSC   %c1  %c3    %c6    %c7

               6.15 3.04 2.69  93.85 0.00   0.00   0.00  

   0   0   0   0.61 3.07 2.69  99.39 0.00   0.00   0.00

   0   1   1   1.12 3.09 2.69  98.88 0.00   0.00   0.00

   0   2   2  92.89 3.09 2.69   7.11 0.00   0.00   0.00  

 

The disadvantage however is that setting any numerical value lower than intel_idle.max_cstate=2 will prevent some of the cores entering a C-State deep enough to then enable another core in the same package reach the full turbo frequency.

 

Another further CPU setting is the Energy/Performance Bias and Red Hat and Oracle users should note that the default setting has changed in the Linux kernel used between the releases of Red Hat/Oracle Linux 5 and Red Hat/Oracle Linux 6. (Some system BIOS options may include a setting to prevent the OS changing this value). In release 5 Linux did not set a value for this setting and therefore the value remained at 0 for a bias towards performance. In Red Hat 6 this behaviour has changed and the default sets a median range to move this bias more towards conserving energy (remember the same Linux kernel is present in both ultrabooks as well as  servers and on my ultrabook I use powertop and the other Linux tools and configurations discussed here to maximise battery life) and reports the following in the dmesg output on boot.

 

ENERGY_PERF_BIAS: Set to 'normal', was 'performance'

ENERGY_PERF_BIAS: View and update with x86_energy_perf_policy(8)

 

One impact of changing this setting can be the time that a process runs before the full turbo boost frequency is used. To change this setting you can use the x86_energy_perf_policy tool.  With this tool the default energy performance policy can be read.

 

[root@sandep1 x86_energy_perf_policy_tool]# ./x86_energy_perf_policy -r

cpu0: 0x0000000000000006

cpu1: 0x0000000000000006

 

You can also use the tool to set a lower value to change the bias entirely towards performance (the default release 5 behaviour).

 

[root@sandep1 x86_energy_perf_policy_tool]# ./x86_energy_perf_policy -v performance

  1. CPUID.06H.ECX: 0x9

cpu0 msr0x1b0 0x0000000000000006 -> 0x0000000000000000

cpu1 msr0x1b0 0x0000000000000006 -> 0x0000000000000000

 

In summary if you want a “Maximum Performance” configuration setting for Oracle your first step is to find out the potential of the processors that you are using and to ask your vendor what parameters the high-level BIOS settings actually set as different vendors may enable different parameters.  If you wish to utilise Turbo Boost do not disable EIST and do not disable C-States at the BIOS or disable the intel_idle cpuidle driver. You can use the intel_idle.max_cstate kernel parameter if you wish to control C-State behaviour, however always test and observe before making changes whilst being aware that enabling some cores to utilise deeper C-States may allow other cores to reach higher Turbo Frequencies for longer. You should also have noted that in fact the default system and Linux settings are often a good starting point for a balanced configuration. On Linux you can use the tools, turbostat, powertop and x86_energy_perf_policy_tool to observe and modify your system behaviour and in the next post I will look at running some simple tests to observe the configuration discussed here.

I recently blogged about configuring Oracle redo on SSD with ASM. For completeness if you are not using ASM and need a filesystem configuration, for example with TimesTen the following is the configuration that I have used

 

Firstly partition the disks exactly as done previously for Oracle:

fdisk -c -u /dev/sda.

 

Use Linux RAID to stripe accross the existing disks:
mdadm --create --verbose /dev/md1 --chunk=128 --level=stripe --raid-devices=4 /dev/sda1 /dev/sdb1 /dev/sdc1 /dev/sdd1

Create an XFS filesystem specifying same stripe and number of disks as RAID array:
mkfs -t xfs -f -d su=128k -d sw=4 /dev/md1

Mount with default options:
/dev/md1        /timesten       xfs  defaults,noatime,nodiratime     0 0

 

Make the configuration permament:
mdadm --examine --scan --config=mdadm.conf > /etc/mdadm.conf

To undo the configuration stop the RAID configuration and zero the superblock of all of the component disks and the remove the configuration from  /etc/mdadm.conf

mdadm -S /dev/md1

mdadm --zero-superblock /dev/sda1

In my previous post I set up Oracle 12c on both 8 * X25-E in RAID 0 and SSD 910 disks. In particular other tests  I then wanted to see how they would perform. In particular we had previously noted that some Oracle tests had suggested mixed results however maybe both the test and the configuration were not necessarily the best that could be chosen for an Oracle environment.    I installed HammerDB on a client system as we know it gives consistent results and created an OLTP database ready to test. I ran a large number of users to drive up the transaction rate on the 2 socket E5 system and captured an AWR report. I ran the test multiple  times to ensure consistency  and this is what I saw for the redo and transactional throughput.

 

8 * X25-E

Per Second

Per Transaction

Redo size (bytes):

93,789,841.7

5,327.5

Transactions:

17,605.0

 

 

SSD 910

Per Second

Per Transaction

Redo size (bytes):

93,240,125.3

5,318.7

Transactions:

17,530.7

 

 

So the 8 * X25-E in RAID 0 with all cache enabled did 1056300 TPM (transactions per minute) and the SSD 910 did 1051842 TPM both within 0.5% of each other. I know the system can deliver a higher transaction rate however I was interested in capturing all of the statistics and comparing the disk redo performance with the same workload and for both disk configurations redo was generated at 5.5GB / minute. (Remember that all of the data is on the SSD 910 as well).

 

Looking at the log writer performance from the AWR the Reqs per sec equivalent to the IOPs are 10559 for the SSD 910 and 13097 for the X25-E RAID configuration so we know there is much more potential for write performance with the SSD 910 rated up to 75,000 write IOPs.

 

8 * X25-E

Function Name

Writes: Data

Reqs per sec

Data per sec

Waits: Count

Avg Tm(ms)

LGWR

28.1G

13097.48

95.512M

2.5M

0.00

 

SSD 910

Function Name

Writes: Data

Reqs per sec

Data per sec

Waits: Count

Avg Tm(ms)

LGWR

32.9G

10559.18

111.736M

0

0.00

 

The top wait events were also similar however the log file sync time was 2ms which appeared a little high.

 

8 * X25-E

Event

Waits

Total Wait Time (sec)

Wait Avg(ms)

% DB time

Wait Class

DB CPU

9032.9

55.3

 

log file sync

2,980,901

4620.3

2

28.3

Commit

 

SSD 910

Event

Waits

Total Wait Time (sec)

Wait Avg(ms)

% DB time

Wait Class

DB CPU

8978.2

52.4

 

log file sync

3,621,973

5664

2

33.0

Commit

 

However the AWR report reported busy CPU at over 98% and checking the log file parallel write time can help determine the time actually spent writing to disks. In this case for both configurations the disk was not a sgnificant component of the log file sync, instead with the CPU so busy the more time is utilsed on CPU scheduling for log file sync.

 

8 * X25-E

Event

Waits

%Time -outs

Total Wait Time (s)

Avg wait (ms)

Waits /txn

% bg time

log file parallel write

2,449,867

0

174

0

0.46

 

61527766.78

 

SSD 910

Event

Waits

%Time -outs

Total Wait Time (s)

Avg wait (ms)

Waits /txn

% bg time

log file parallel write

2,055,279

0

208

0

0.39

80190288.85

 

Taking a brief snapshot of the iostat data also confirms that for both configurations we could drive redo throughput much higher and we are maximising CPU before disk.

 

8 * X25-E

Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await  svctm  %util

sdf               0.00     0.00    0.00 15137.67     0.00    96.30    13.03     0.64    0.04   0.02  37.8

 

SSD 910

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          86.23    0.00    8.44    0.63    0.00    4.70

Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await  svctm  %util
sdc               0.00     0.00   25.67 2558.67     0.40    27.67    22.24     0.19    0.07   0.05  13.77
sdb               0.00     0.00   17.33 2578.67     0.27    27.79    22.14     0.18    0.07   0.05  13.47
sda               0.00     0.00   18.67 2574.67     0.28    27.67    22.08     0.18    0.07   0.05  13.30
sdd               0.00     0.00   21.00 2551.33     0.33    27.92    22.49     0.19    0.07   0.05  13.87

 

One other area of note in the AWR was that redo wastage was higher for the SSD 910. Oracle aligns redo with the disk sector chosen in this case 512 byte for X25-E and 4KB for SSD 910, the log writer will never re-read a block that is not completely filled and always begins at a block boundary. With a larger sector and redo block size we see more wastage however this is not of great concern as we have already seen that the transaction throughput is almost the same.

Finally the new performance view V$LGWRIO_OUTLIER in 12c allowed me to check if any of the I/O requests were taking more than 500ms to complete, suprisingly there were a few entires in here for X25-E configuration but none for the SSD 910.

 

SQL> /

   IO_SIZE WAIT_EVENT                                                          
---------- ----------------------------------------------------------------    
FILE_NAME                                                                      
--------------------------------------------------------------------------------
IO_LATENCY                                                                     
----------                                                                     
      1024 log file parallel write                                             
+X25E/SBDB1/ONLINELOG/group_3.256.821098037                                    
620735830                                                                     
                                                                               
        10 log file parallel write                                             
+X25E/SBDB1/ONLINELOG/group_3.256.821098037                                    
622271590    

...                                                            
                       

Is it worth investigating the X25-E outliers? No, as we have noted before this configuration was brought out of retirement for comparsion and is an EOL product.

 

So in summary should you continue to put your Oracle redo (and data) on SSDs? Yes absolutely, current products at the time of writing such as SSD 910 raise the bar sufficiently for Oracle redo whatever your demands for throughput and yes if you check the product brief you can also see that this product and current data center products such as the DC S3700 come with the feature Enhanced Power Loss Data Protection.

Recently my initial post on Redo and Transaction Logs on SSDs using the X25-E received a comment almost a couple of years after the original post and as I replied the X25-E was launched in 2008 and is end of life. SSD technology has moved on considerably as also has the server platform, operating system and database so now seems like a good time for an update to see whether putting redo on the latest SSD products is still a good idea.


I’m currently working with the Intel SSD 910 Series for Database environments and for the control I will use the older  Intel® RAID Controller RS2BL080 with 8 x Intel X25-E Extreme SATA SSD in a RAID 0 configuration. Unfortunately the hard disk storage array used in the previous test has long since expired and not been replaced – however the X25-E configuration provided the highest performance in the previous comparison and being an SLC drive with both onboard disk cache and write back cache enabled on the RAID controller still provides a high performance database write configuration for our control.


Before starting it was interesting to note that some published reviews of Oracle performance seemed to indicate that performance would be lower, there was one for example of the SSD 910 over at AnandTech that reports Oracle throughput  at 1/3rd the rate of a single X25-E!. It is therefore beneficial to cover the details of the configuration beforehand to ensure we have it correct. I will cover configuration in this part and then tests in the next.


For my test system I used a  2 socket Intel E5 server that I had to hand (so the performance data is not going to be directly comparable with the 4 socket from the previous tests) and I installed the SSD910 in an x8 PCIe slot as per the specification. I also installed the X25-E RAID configuration. I then installed Oracle Linux 6.3. noting that Red Hat 6 and compatible releases such as Oracle Linux have support for disks with 4KB sectors.  As we move ahead you will see why this is important – I used Oracle 12c however any version of Oracle from 11.2.0.1 has support for 4KB Sector Disk Drives . To configure the OS I ran the Oracle pre-install RPM.


[root@sandep1 ~]# yum list | grep rdbms
oracle-rdbms-server-11gR2-preinstall.x86_64



Since I used Oracle 12c another change since the previous blog post is that raw devices are no longer supported, they still work exactly as before but there is little gain from testing unsupported configurations so I focused entirely on ASM.  Using the Intel(R) SSD Data Center Tool I could see that the SSD 910 presented 4 200GB drives to the OS and ASM will configure all 4 in a single disk group.


  | 0     | INTEL(R) SSD 910 200GB | 
  | 1     | INTEL(R) SSD 910 200GB |
  | 2     | INTEL(R) SSD 910 200GB |
  | 3     | INTEL(R) SSD 910 200GB | 



I had previously used the disks for numerous workloads and did not perform a write erase beforehand to ensure that the disks were operating in a steady state type configuration. I had previously partitioned all 4 disks with fdisk –c –u to disable DOS compatibility. The partition should start at sector 2048. Note that the physical sector size is reported as 4KB.. The logical sector size is reported as 512 bytes for backward compatibility.



root@sandep1 ~]# fdisk -c -u -l /dev/sda
Disk /dev/sda: 200.0 GB, 200049647616 bytes
14 heads, 28 sectors/track, 996739 cylinders, total 390721968 sectors
Units = sectors of 1 * 512 = 512 bytesSector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes
Disk identifier: 0xbaeb87f9
   Device Boot      Start         End      Blocks   Id  System
/dev/sda1            2048   390721967   195359960   83  Linux



Looking at the X25-E RAID configuration that is reported as 512 byte sectors as noted in the first blog post.



Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes



As I am using ASM I next installed ASMLIB. As I used the UEK kernel the kmod-oracleasm RPM is not required however for Red Hat it is required as an additional install. For all systems you also need both oracleasm-support and oracleasmlib of which only oracleasm-support is available on the install media however you can get both from the Oracle public Yum server.



[root@sandep1 ~]# yum list | grep oracleasm
oracleasm-support.x86_64              2.1.8-1.el6                      @/oracleasm-support-2.1.8-1.el6.x86_64
oracleasmlib.x86_64                   2.0.4-1.el6                      @/oracleasmlib-2.0.4-1.el6.x86_64



After installing and configuring ASMLIB I then created ASM disks on the SSD 910 disks and the single X25-E presented device.


[root@sandep1 disks]# /etc/init.d/oracleasm createdisk SSD910A /dev/sda1
Marking disk "SSD910A" as an ASM disk:                     [  OK  ]
[root@sandep1 disks]# /etc/init.d/oracleasm createdisk SSD910B /dev/sdb1
Marking disk "SSD910B" as an ASM disk:                     [  OK  ]
[root@sandep1 disks]# /etc/init.d/oracleasm createdisk SSD910C /dev/sdc1
Marking disk "SSD910C" as an ASM disk:                     [  OK  ]
[root@sandep1 disks]# /etc/init.d/oracleasm createdisk SSD910D /dev/sdd1
Marking disk "SSD910D" as an ASM disk:                     [  OK  ]



And checked that the disks could be discovered


[root@sandep1 media]# /usr/sbin/oracleasm-discover 
Using ASMLib from /opt/oracle/extapi/64/asm/orcl/1/libasm.so
[ASM Library - Generic Linux, version 2.0.4 (KABI_V2)]
Discovered disk: ORCL:SSD910A [48839990 blocks (200048599040 bytes), maxio 64]
Discovered disk: ORCL:SSD910B [48839990 blocks (200048599040 bytes), maxio 64]
Discovered disk: ORCL:SSD910C [48839990 blocks (200048599040 bytes), maxio 64]
Discovered disk: ORCL:SSD910D [48839990 blocks (200048599040 bytes), maxio 64]
Discovered disk: ORCL:X25E [991950848 blocks (507878834176 bytes), maxio 512]



Note that in ASMLIB there is an option to configure disks using the logical or physical sector size. I left this at the default which is to use the physical sector size. Firing up the asmca utility I could see my disks using the default diskstring of ORCL:*. If you cannot see you disks do not change the disk string to /dev/oracleasm/disks*, this bypasses ASMLIB and one of the impacts is that it will not detect the correct disk sector size. I created 2 diskgroups as shown.

 

 

ASMCA.png


Checking in ASM I could see the diskgroups had been configured correctly. Note that the correct sector size has been detected automatically, no underscore parameters in  particular _disk_sector_size_override have been used. – this is not required.



[oracle@sandep1 ~]$ sqlplus / as sysasm
SQL*Plus: Release 12.1.0.1.0 Production on Thu Jul 18 09:48:41 2013
Copyright (c) 1982, 2013, Oracle.  All rights reserved.
Connected to:
Oracle Database 12c Enterprise Edition Release 12.1.0.1.0 - 64bit Production
With the Automatic Storage Management option
SQL> select name, sector_size from v$asm_diskgroup;
NAME                        SECTOR_SIZE
------------------------------ -----------
X25E                              512
SSD910                           4096



Next I fired up the dbca utility to create a database, for fairness I used the SSD910 diskgroup for all data storage as well as redo and will only put redo on the X25-E

 

SSD910create.png


This ran through to completion however on starting the new database it reported the following error.



SQL> startup
ORA-01078: failure in processing system parameters
ORA-15081: failed to submit an I/O operation to a disk
ORA-27091: unable to queue I/O
ORA-17507: I/O request size 512 is not a multiple of logical block size
ORA-06512: at line 4


My Oracle Support  shows an Oracle bug  “Bug 14626924  Not able to read spfile from ASM diskgroup and disk with sector size of 4096” with a workaround of “Use a pfile rather than an ASM resident spfile”.

I used ASMCMD to copy the spfile, and the strings command to extract the parameters and create a pfile (using SQL*Plus gives the error ORA-17512: Block Verification Failed)


ASMCMD> cp spfileSBDB1.ora /home/oracle/sbdb1newpfile;
copying +SSD910/SBDB1/spfileSBDB1.ora -> /home/oracle/sbdb1newpfile;


I then created an spfile from the pfile, this can be either external to ASM or in another ASM diskgroup without a secotr size of 4KB until the Oracle bug is fixed.


SQL> create spfile='/u01/app/oracle/product/12.1.0/dbhome_1/dbs/spfileSBDB1.ora' from pfile='/home/oracle/ sbdb1newpfile';
File created.
SQL> 



And an init.ora file to point to the spfile


[oracle@sandep1 dbs]$ more initSBDB1.ora
SPFILE='/u01/app/oracle/product/12.1.0/dbhome_1/dbs/spfileSBDB1.ora'



The Oracle instance then started as expected:


SQL> startup
ORACLE instance started.
Total System Global Area 9.8615E+10 bytes
Fixed Size               4661112 bytes
Variable Size        7784628360 bytes
Database Buffers     9.0194E+10 bytes
Redo Buffers           631664640 bytes
Database mounted.
Database opened.
SQL> 



Checking the disk group also showed the correct sector size, this time from the database instance.



SQL> select name, SECTOR_SIZE from v$asm_diskgroup;
NAME                        SECTOR_SIZE 
------------------------------ ----------- 
X25E                              512
SSD910                           4096



So since we are interested in redo I took a look at the default blocksize used for the redo logs, by default these were 4KB.



SQL> select group#, blocksize from v$log;
    GROUP#  BLOCKSIZE
---------- ----------
        1     4096
        2     4096
        3     4096



I could also see what happens when I create new logfiles without specifying a block size.


SQL> alter database add logfile '+X25E' size 512m;
Database altered.
SQL> alter database add logfile '+SSD910' size 512m;
Database altered.



Oracle has correctly identified the underlying sector size and automatically created the logfile with the correct blocksize.



SQL> select group#, blocksize from v$log;
    GROUP#  BLOCKSIZE
---------- ----------
        1     4096
        2     4096
        3     4096
        4      512
        5     4096



I then deleted the existing redo log files and created four larger redo log files (so as not to checkpoint during a test) with 2 on X25-E and 2 on SSD 910.


In summary I have configured an Oracle database with both data and redo on the SSD 910. I ensured that my disks were correctly partitioned to be aligned, I selected an OS and database that was 4KB sector aware and ensured that ASM was configured correctly in order that the disk were configured correctly and automatically in Oracle without  any additional underscore parameters or block size settings required. I was then ready to run a few tests to compare performance.

In previous posts in this series I explored the steps in writing an existing SQL query to instead run in Hadoop as a Map Reduce Job. These blog posts very much complement the slides from my session at Oracle Openworld 2012.


This series would not be complete however without mentioning other alternatives for achieving the same aim, notably Hive and Pig. here I will focus on Hive mainly as there is already detailed information on TPC-H queries including query 1 to run on Pig here. Pig provides a script based approach to writing Map Reduce jobs whereas Hive uses a SQL like interface and therefore for this particular example translating a SQL Query Hive is the most appropriate tool here without necessarily assuming that it would be better for all circumstances.


Hive automatically translates a language called HiveQL a subset of SQL into Map Reduce jobs enabling you to take advantage of your existing SQL skills, there are limitations however and Hive for example does not support all of the SQL functionality available in a relational databases such as correlated subqueries.


To use Hive you need to download and install the additional Hive package. When you run Hive you are presented with a Hive prompt with the first stage being to define the table to query. This example uses the data loaded in Pt1 of this blog series whilst noting that Hive does not support all of the data types available in an Oracle database.

 

 

CREATE EXTERNAL TABLE lineitem (
L_ORDERKEY INT
,L_PARTKEY INT
,L_SUPPKEY INT
,L_LINENUMBER INT
,L_QUANTITY INT
,L_EXTENDEDPRICE DOUBLE
,L_DISCOUNT DOUBLE
,L_TAX DOUBLE
,L_RETURNFLAG STRING
,L_LINESTATUS STRING
,L_SHIPDATE STRING
,L_COMMITDATE STRING
,L_RECEIPTDATE STRING
,L_SHIPINSTRUCT STRING
,L_SHIPMODE STRING
,L_COMMENT STRING) ROW FORMAT
DELIMITED FIELDS TERMINATED BY '|'
LINES TERMINATED BY '\n' 
STORED AS TEXTFILE
LOCATION '/user/input/tpch/256';

 

With the table defined you can then run your SQL at the Hive prompt. The main modification is the simplification of the l_shipdate field. Also knowing the characteristics of the Job we set the number of reducers to 1 before the query is run with the following example output.

 

 

hive> set hive.exec.reducers.max=1;
hive> select 
    >  l_returnflag, 
    >  l_linestatus, 
    >  sum(l_quantity) as sum_qty, 
    >  sum(l_extendedprice) as sum_base_price, 
    >  sum(l_extendedprice * (1 - l_discount)) as sum_disc_price, 
    >  sum(l_extendedprice * (1 - l_discount) * (1 + l_tax)) as sum_charge, 
    >  avg(l_quantity) as avg_qty, 
    >  avg(l_extendedprice) as avg_price, 
    >  avg(l_discount) as avg_disc, 
    >  count(*) as count_order 
    > from lineitem 
    > where l_shipdate <= '1998-08-11'
    > group by l_returnflag, l_linestatus 
    > order by l_returnflag, l_linestatus;
Total MapReduce jobs = 2
Launching Job 1 out of 2
...
Hadoop job information for Stage-1: number of mappers: 297; number of reducers: 1
2012-09-06 16:36:20,579 Stage-1 map = 0%,  reduce = 0%
2012-09-06 16:36:49,465 Stage-1 map = 1%,  reduce = 0%, Cumulative CPU 311.34 sec
...
2012-09-06 16:40:10,495 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 5565.64 sec
...
Launching Job 2 out of 2
...
Hadoop job information for Stage-2: number of mappers: 1; number of reducers: 1
2012-09-06 16:40:14,217 Stage-2 map = 0%,  reduce = 0%
...
2012-09-06 16:40:20,425 Stage-2 map = 100%,  reduce = 100%, Cumulative CPU 1.77 sec
...
MapReduce Jobs Launched: 
Job 0: Map: 297  Reduce: 1   Accumulative CPU: 5565.64 sec   HDFS Read: 0 HDFS Write: 0 SUCESS
Job 1: Map: 1  Reduce: 1   Accumulative CPU: 1.77 sec   HDFS Read: 0 HDFS Write: 0 SUCESS
...
OK
A F 3775127758 5.660776097194458E12 5.377736398183936E12 
5.59284742951593E12 25.499370423275426 38236.116984304950.050002243530958 148047881
N F 98553062 1.4777109838598007E11 1.4038496596503482E11 
1.4599979303277582E11 25.501556956882876 38237.1993888045250.049985284338051675 3864590
N O 7318871052 1.09746348095171E13 1.0425912729393264E13 
1.084295716582229E13 25.499840883624667 38236.9684354666460.04999754512223738 287016342
R F 3775724970 5.661603032745331E12 5.378513563915406E12 
5.593662252666915E12 25.50006628406532 38236.697258452910.050001304339683 148067261


It is interesting to observe that the job ran as 2 map reduce jobs instead of the one for the Job written by hand. Nevertheless the result was the same and the completion time as similar. However Hive significantly saved on the time it took to write the initial MapReduce Job to do the same thing. Consequently anyone looking to Hadoop  should certainly consider tools such as Hive and Pig for writing Map Reduce jobs especially when experienced with existing database environments.

Big Data is a key enterprise topic at present so this year at Oracle Openworld I'll be taking a technical look at Big Data in an Oracle context including Hadoop, Hive and the Oracle Database. Check out the Intel IT Center for an introduction to Big Data and other resources.

 

What I will be doing in this Openworld session on Big Data Performance rather than work with a standard example such as 'word count' is to take a hands-on example of a straightforward database query, show how that translates into a Hadoop MapReduce job and look at the performance and tuning characteristics. I will also post and work through the Hadoop example here on the Data Stack so you can run it in your own test environment. I will also look at the same query running under Hive and the Oracle Database to give the all important context of how Big Data fits into your existing Data Centre environment.

 

Look out our for the example workload in upcoming blog posts and if you are at Oracle Openworld 2012 I hope to see you at my session!

Session ID: CON4024

Optimal Oracle Systems for Big Data Performance and Scalability

Moscone South Room: 236

Date: Wednesday 3rd October Time: 10:15 - 11:15

 

Are you going to be in San Francisco, but don't have a ticket to the big event? Here's an opportunity to get a FREE Discover Pass to the event!

Head over to The Server Room Facebook Page and answer this simple question:

 

What’s the next leap for big data?

Cloud; Small Business; Mobile, or Nowhere: Big Data Will Stay with the Big Enterprise – Or something else? Share with us what you think!



You’ll need to be a US citizen, over 18 years old, and are responsible for your own travel/lodging costs should you be picked as one of the lucky respondents.

Filter Blog

By author:
By date:
By tag: