The Data Stack

12 Posts authored by: Steve Shaw

In previous posts in this series I explored the steps in writing an existing SQL query to instead run in Hadoop as a Map Reduce Job. These blog posts very much complement the slides from my session at Oracle Openworld 2012.


This series would not be complete however without mentioning other alternatives for achieving the same aim, notably Hive and Pig. here I will focus on Hive mainly as there is already detailed information on TPC-H queries including query 1 to run on Pig here. Pig provides a script based approach to writing Map Reduce jobs whereas Hive uses a SQL like interface and therefore for this particular example translating a SQL Query Hive is the most appropriate tool here without necessarily assuming that it would be better for all circumstances.


Hive automatically translates a language called HiveQL a subset of SQL into Map Reduce jobs enabling you to take advantage of your existing SQL skills, there are limitations however and Hive for example does not support all of the SQL functionality available in a relational databases such as correlated subqueries.


To use Hive you need to download and install the additional Hive package. When you run Hive you are presented with a Hive prompt with the first stage being to define the table to query. This example uses the data loaded in Pt1 of this blog series whilst noting that Hive does not support all of the data types available in an Oracle database.

 

 

CREATE EXTERNAL TABLE lineitem (
L_ORDERKEY INT
,L_PARTKEY INT
,L_SUPPKEY INT
,L_LINENUMBER INT
,L_QUANTITY INT
,L_EXTENDEDPRICE DOUBLE
,L_DISCOUNT DOUBLE
,L_TAX DOUBLE
,L_RETURNFLAG STRING
,L_LINESTATUS STRING
,L_SHIPDATE STRING
,L_COMMITDATE STRING
,L_RECEIPTDATE STRING
,L_SHIPINSTRUCT STRING
,L_SHIPMODE STRING
,L_COMMENT STRING) ROW FORMAT
DELIMITED FIELDS TERMINATED BY '|'
LINES TERMINATED BY '\n' 
STORED AS TEXTFILE
LOCATION '/user/input/tpch/256';

 

With the table defined you can then run your SQL at the Hive prompt. The main modification is the simplification of the l_shipdate field. Also knowing the characteristics of the Job we set the number of reducers to 1 before the query is run with the following example output.

 

 

hive> set hive.exec.reducers.max=1;
hive> select 
    >  l_returnflag, 
    >  l_linestatus, 
    >  sum(l_quantity) as sum_qty, 
    >  sum(l_extendedprice) as sum_base_price, 
    >  sum(l_extendedprice * (1 - l_discount)) as sum_disc_price, 
    >  sum(l_extendedprice * (1 - l_discount) * (1 + l_tax)) as sum_charge, 
    >  avg(l_quantity) as avg_qty, 
    >  avg(l_extendedprice) as avg_price, 
    >  avg(l_discount) as avg_disc, 
    >  count(*) as count_order 
    > from lineitem 
    > where l_shipdate <= '1998-08-11'
    > group by l_returnflag, l_linestatus 
    > order by l_returnflag, l_linestatus;
Total MapReduce jobs = 2
Launching Job 1 out of 2
...
Hadoop job information for Stage-1: number of mappers: 297; number of reducers: 1
2012-09-06 16:36:20,579 Stage-1 map = 0%,  reduce = 0%
2012-09-06 16:36:49,465 Stage-1 map = 1%,  reduce = 0%, Cumulative CPU 311.34 sec
...
2012-09-06 16:40:10,495 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 5565.64 sec
...
Launching Job 2 out of 2
...
Hadoop job information for Stage-2: number of mappers: 1; number of reducers: 1
2012-09-06 16:40:14,217 Stage-2 map = 0%,  reduce = 0%
...
2012-09-06 16:40:20,425 Stage-2 map = 100%,  reduce = 100%, Cumulative CPU 1.77 sec
...
MapReduce Jobs Launched: 
Job 0: Map: 297  Reduce: 1   Accumulative CPU: 5565.64 sec   HDFS Read: 0 HDFS Write: 0 SUCESS
Job 1: Map: 1  Reduce: 1   Accumulative CPU: 1.77 sec   HDFS Read: 0 HDFS Write: 0 SUCESS
...
OK
A F 3775127758 5.660776097194458E12 5.377736398183936E12 
5.59284742951593E12 25.499370423275426 38236.116984304950.050002243530958 148047881
N F 98553062 1.4777109838598007E11 1.4038496596503482E11 
1.4599979303277582E11 25.501556956882876 38237.1993888045250.049985284338051675 3864590
N O 7318871052 1.09746348095171E13 1.0425912729393264E13 
1.084295716582229E13 25.499840883624667 38236.9684354666460.04999754512223738 287016342
R F 3775724970 5.661603032745331E12 5.378513563915406E12 
5.593662252666915E12 25.50006628406532 38236.697258452910.050001304339683 148067261


It is interesting to observe that the job ran as 2 map reduce jobs instead of the one for the Job written by hand. Nevertheless the result was the same and the completion time as similar. However Hive significantly saved on the time it took to write the initial MapReduce Job to do the same thing. Consequently anyone looking to Hadoop  should certainly consider tools such as Hive and Pig for writing Map Reduce jobs especially when experienced with existing database environments.

Big Data is a key enterprise topic at present so this year at Oracle Openworld I'll be taking a technical look at Big Data in an Oracle context including Hadoop, Hive and the Oracle Database. Check out the Intel IT Center for an introduction to Big Data and other resources.

 

What I will be doing in this Openworld session on Big Data Performance rather than work with a standard example such as 'word count' is to take a hands-on example of a straightforward database query, show how that translates into a Hadoop MapReduce job and look at the performance and tuning characteristics. I will also post and work through the Hadoop example here on the Data Stack so you can run it in your own test environment. I will also look at the same query running under Hive and the Oracle Database to give the all important context of how Big Data fits into your existing Data Centre environment.

 

Look out our for the example workload in upcoming blog posts and if you are at Oracle Openworld 2012 I hope to see you at my session!

Session ID: CON4024

Optimal Oracle Systems for Big Data Performance and Scalability

Moscone South Room: 236

Date: Wednesday 3rd October Time: 10:15 - 11:15

 

Are you going to be in San Francisco, but don't have a ticket to the big event? Here's an opportunity to get a FREE Discover Pass to the event!

Head over to The Server Room Facebook Page and answer this simple question:

 

What’s the next leap for big data?

Cloud; Small Business; Mobile, or Nowhere: Big Data Will Stay with the Big Enterprise – Or something else? Share with us what you think!



You’ll need to be a US citizen, over 18 years old, and are responsible for your own travel/lodging costs should you be picked as one of the lucky respondents.

In Pts 2 to 3 of this blog post I wrote the Map, Reduce and Combiner functions to process TPCH Query 1 in Hadoop. The final stage is to write the job control section and run the job to see how it performs in a test environment. An example job control section looks as follows, note that on earlier versions of Hadoop you need to replace Job.getInstance() with Job(new Configuration() to ensure that the Job will compile.

 

 

public static void main(String[] args) throws Exception {  
      Job job = Job.getInstance();  
      /* Job job = new Job(new Configuration()); */
      job.setNumReduceTasks(1);
      job.setJarByClass(TpchQ1.class);
      job.setJobName("tpchq1");  
      FileInputFormat.addInputPath(job, new Path(args[0]));  
      FileOutputFormat.setOutputPath(job, new Path(args[1]));  
      job.setMapOutputKeyClass(Text.class);  
      job.setMapOutputValueClass(DoubleArrayWritable.class);  
  
      job.setOutputKeyClass(Text.class);  
      job.setOutputValueClass(Text.class);  
  
      job.setMapperClass(Map.class);  
      job.setCombinerClass(Combine.class);
      job.setReducerClass(Reduce.class);  
  
System.exit(job.waitForCompletion(true) ? 0 : 1);
    }

 

 

Now we have the final Job and the full code is attached to the end of this blog post ready to compile and run. Copy this job and compile and package it as follows:

 


javac -classpath `./hadoop-2.0.0-cdh4.0.1/bin/hadoop classpath` -d tpchq1_classes/ TpchQ1.java
jar cvf ./tpchq1.jar -C tpchq1_classes/ .

 

The only remaining step is to run the job. Note that in the Job control section we specified an input and output path so these arguments are shown below with the input path being the path to the lineitem.tbl file loaded into HDFS in Pt1. You can track the job with your web browser by accessing the URL given in the job information as it runs.

 

 

./hadoop-2.0.0-cdh4.0.1/bin/hadoop jar tpchq1.jar org.myorg.TpchQ1 /user/input/tpch/lineitem256.tbl /user/output/tpch/TPCHQ1
...
INFO mapreduce.Job: The url to track the job: http://sandep1.example.com:8088/proxy/application_1344269167473_0011/
...
INFO mapreduce.Job: Job job_1344269167473_0011 running in uber mode : false
INFO mapreduce.Job:  map 0% reduce 0%
INFO mapreduce.Job:  map 1% reduce 0%
INFO mapreduce.Job:  map 2% reduce 0%
INFO mapreduce.Job:  map 3% reduce 0%
INFO mapreduce.Job:  map 4% reduce 0%
INFO mapreduce.Job:  map 5% reduce 0%
...
INFO mapreduce.Job:  map 96% reduce 29%
INFO mapreduce.Job:  map 97% reduce 30%
INFO mapreduce.Job:  map 99% reduce 30%
INFO mapreduce.Job:  map 100% reduce 30%
INFO mapreduce.Job:  map 100% reduce 100%
INFO mapreduce.Job: Job job_1344269167473_0011 completed successfully

 

When complete you can use either the command line tools or the web interface at default port 50070 to view the output file. For example (http://sandep1.example.com:50070). The output should resemble the following with which we have answered TPC-H Query 1.

 

 

A F  3.775127758E9 5.660776097194444E12 5.377736398183933E12 5.592847429515926E12 25.499370423275426 38236.116984304856 0.05000224353100122 1.48047881E8
N F  9.8553062E7 1.4777109838597998E11 1.403849659650348E11 1.4599979303277582E11 25.501556956882876 38237.199388804496 0.04998528433805351 3864590.0
N O  7.318871052E9 1.0974634809517127E13 1.0425912729393295E13 1.0842957165822328E13 25.499840883624667 38236.96843546674 0.04999754512239288 2.87016342E8
R F  3.77572497E9 5.661603032745334E12 5.378513563915411E12 5.593662252666924E12 25.50006628406532 38236.69725845293 0.05000130433972626 1.48067261E8

 

You might have observed that nowhere did we specify the ordering of the output  that is in the original SQL statement "order by l_returnflag, l_linestatus", this is because the ordering is the default behaviour of Hadoop so the output was sorted without being specified.

 

 

If you are interested in Job performance and tuning then come learn more at the Oracle Openworld 2012 Session "Optimal Oracle Systems for Big Data Performance and Scalability"

Session ID: CON4024
Session Title: Optimal Oracle Systems for Big Data Performance and Scalability
Moscone South Room: 236
Date: Wednesday 3rd October
Time: 10:15 - 11:15

 

 


/*
HADOOP MR Job to process TPCH Query 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
Query 1 (in Oracle Database Syntax)
select 
l_returnflag, 
l_linestatus, 
sum(l_quantity) as sum_qty, 
sum(l_extendedprice) as sum_base_price, 
sum(l_extendedprice * (1 - l_discount)) as sum_disc_price, 
sum(l_extendedprice * (1 - l_discount) * (1 + l_tax)) as sum_charge, 
avg(l_quantity) as avg_qty, 
avg(l_extendedprice) as avg_price, 
avg(l_discount) as avg_disc, 
count(*) as count_order from lineitem where l_shipdate <= date '1998-12-01' - interval '112' day (3) 
group by l_returnflag, l_linestatus 
order by l_returnflag, l_linestatus
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
Input data is file called linieitem.tbl generated by dbgen available from tpc.org
Example Row of expected input data is as follows:
1|15518935|768951|1|17|33203.72|0.04|0.02|N|O|1996-03-13|1996-02-12|1996-03-22|DELIVER IN PERSON|TRUCK|egular courts above the|
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
In this example l_shipdate in the where clause is <= 11-AUG-98 i.e.
SQL> select date '1998-12-01' - interval '112' day (3) from dual;
DATE'1998
---------
11-AUG-98
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
Example Output on SF100 Schema 
A F  3.775127758E9 5.660776097194444E12 5.377736398183933E12 5.592847429515926E12 25.499370423275426 38236.116984304856 0.05000224353100122 1.48047881E8
N F  9.8553062E7 1.4777109838597998E11 1.403849659650348E11 1.4599979303277582E11 25.501556956882876 38237.199388804496 0.04998528433805351 3864590.0
N O  7.318871052E9 1.0974634809517127E13 1.0425912729393295E13 1.0842957165822328E13 25.499840883624667 38236.96843546674 0.04999754512239288 2.87016342E8
R F  3.77572497E9 5.661603032745334E12 5.378513563915411E12 5.593662252666924E12 25.50006628406532 38236.69725845293 0.05000130433972626 1.48067261E8
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
Example compile and run using Cloudera Hadoop
javac -classpath `./hadoop-2.0.0-cdh4.0.1/bin/hadoop classpath` -d tpchq1_classes/ TpchQ1.java
jar cvf ./tpchq1.jar -C tpchq1_classes/ .
./hadoop-2.0.0-cdh4.0.1/bin/hadoop jar tpchq1.jar org.myorg.TpchQ1 /user/input/tpch/lineitem.tbl /user/output/tpch/TPCHQ1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
*/
package org.myorg;  
import java.io.IOException;  
import java.util.*;  
import org.apache.hadoop.fs.Path;  
import org.apache.hadoop.io.*;  
import org.apache.hadoop.mapreduce.Mapper;  
import org.apache.hadoop.mapreduce.Reducer;  
import org.apache.hadoop.mapreduce.Job;  
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;  
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;  
import org.apache.hadoop.util.*;  
  
public class TpchQ1 {  
  public static class Map extends Mapper <LongWritable, Text, Text, DoubleArrayWritable> {  
@Override 
      public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {  
     Text tpchq1key = new Text();
     DoubleWritable[] outArray = new DoubleWritable[6];
     /* Set and clear date to use in the where clause as 11th August 1998 */
     Calendar wc = Calendar.getInstance();
     wc.clear();
     wc.set(1998,8,11,0,0,0);
     wc.set(Calendar.MILLISECOND, 0);
     /* Set and clear date to use for input */
     Calendar win = Calendar.getInstance();
     win.clear();
     win.set(1970,1,1,0,0,0);
     win.set(Calendar.MILLISECOND, 0);
/* Parse each line of lineitem.tbl and extract the 6 fields of interest for the SQL Query starting from field 5 */
     String line = value.toString();  
     StringTokenizer tokenizer = new StringTokenizer(line,"|");  
int i = 1;
double l_extendedprice = 0.0;
double l_discount = 0.0;
double l_tax = 0.0;
        while (tokenizer.hasMoreTokens()) {  
String token = tokenizer.nextToken();
if(i == 5) {
/* Fields 5 to 8 are the values */
/* L_QUANTITY */
outArray[0] = new DoubleWritable(Double.parseDouble(token)); 
}
if(i == 6) {
/* L_EXTENDEDPRICE and SUM_BASE_PRICE */
outArray[1] = new DoubleWritable(Double.parseDouble(token)); 
/* */
}
else if (i == 7) {
/* L_DISCOUNT and SUM_DISCOUNT */
outArray[2] = new DoubleWritable(Double.parseDouble(token)); 
}
else if ( i == 8 ) {
/* L_TAX */
l_tax = (Double.parseDouble(token)); 
}
/* Fields 9 and 10 are the GROUP BY expression in the SQL query and are therefore concatendated as the Key */
else if (i == 9) {
/* L_RETURNFLAG */
tpchq1key.set(tpchq1key.toString()+" "+token);
}
else if (i == 10) {
/* L_LINESTATUS */
tpchq1key.set(tpchq1key.toString()+" "+token);
}
else if (i == 11) {
/* L_SHIPDATE */
StringTokenizer wheredate = new StringTokenizer(token,"-");
win.set(Integer.parseInt(wheredate.nextToken()),Integer.parseInt(wheredate.nextToken()),Integer.parseInt(wheredate.nextToken()));
}
i++;
    }  
/* Implement the WHERE clause, only collect the output if L_SHIPDATE <= (1998,8,11) */
if(win.getTimeInMillis() <= wc.getTimeInMillis()) {
l_extendedprice = outArray[1].get();
l_discount = outArray[2].get();
/* Calculate Sum disc price and Sum Charge */
double sum_disc_price = (l_extendedprice * (1 - l_discount));
double sum_charge = (l_extendedprice * (1 - l_discount) * (1 + l_tax));
outArray[3] = new DoubleWritable(1.0);
outArray[4] = new DoubleWritable(sum_disc_price);
outArray[5] = new DoubleWritable(sum_charge);
DoubleArrayWritable da = new DoubleArrayWritable();
da.set(outArray);
context.write(tpchq1key, da);
  } 
   }  
}
public static class Combine extends Reducer <Text, DoubleArrayWritable, Text, DoubleArrayWritable> {   
@Override
public void reduce(Text key, Iterable<DoubleArrayWritable> values, Context context) throws IOException, InterruptedException {  
int i = 1;
DoubleWritable[] outArray = new DoubleWritable[6];
DoubleWritable[] x;
double sum_qty = 0.0;
double sum_base_price = 0.0;
double sum_discount = 0.0;
double count_star = 0.0;
double sum_disc_price = 0.0;
double sum_charge = 0.0;
/* In Combiner calculate interim sum values */
for (DoubleArrayWritable val : values) { 
x = (DoubleWritable[]) val.toArray();
sum_qty += x[0].get();
sum_base_price += x[1].get();
sum_discount += x[2].get();
count_star += x[3].get();
sum_disc_price += x[4].get();
sum_charge += x[5].get();
  }
outArray[0] = new DoubleWritable(sum_qty); 
outArray[1] = new DoubleWritable(sum_base_price); 
outArray[2] = new DoubleWritable(sum_discount); 
outArray[3] = new DoubleWritable(count_star); 
outArray[4] = new DoubleWritable(sum_disc_price);
outArray[5] = new DoubleWritable(sum_charge);
DoubleArrayWritable da = new DoubleArrayWritable();
da.set(outArray);
context.write(key, da);
     }  
}
public static class Reduce extends Reducer <Text, DoubleArrayWritable, Text, Text> {   
@Override
public void reduce(Text key, Iterable<DoubleArrayWritable> values, Context context) throws IOException, InterruptedException {  
String line;
int i = 1;
DoubleWritable[] x;
double sum_qty = 0.0;
double sum_base_price = 0.0;
double sum_discount = 0.0;
double sum_disc_price = 0.0;
double sum_charge = 0.0;
double avg_qty = 0.0;
double avg_price = 0.0;
double avg_disc = 0.0;
double count_star = 0.0;
/* In reducer calculate sum values and averages */
for (DoubleArrayWritable val : values) {
x = (DoubleWritable[]) val.toArray();
sum_qty += x[0].get();
sum_base_price += x[1].get();
sum_discount += x[2].get();
count_star += x[3].get();
sum_disc_price += x[4].get();
sum_charge += x[5].get();
        }
avg_qty = sum_qty/count_star;
avg_price = sum_base_price/count_star;
avg_disc = sum_discount/count_star;
/* Format and collect the output */
Text tpchq1redval = new Text(" "+sum_qty+" "+sum_base_price+" "+sum_disc_price+" "+sum_charge+" "+avg_qty+" "+avg_price+" "+avg_disc+" "+count_star);
       context.write(key, tpchq1redval);  
     }  
}
public static class DoubleArrayWritable extends ArrayWritable { 
public DoubleArrayWritable() { 
    super(DoubleWritable.class); 
} 
public DoubleArrayWritable(DoubleWritable[] values) { 
    super(DoubleWritable.class, values); 
} 
public DoubleWritable get(int idx) { 
    return(DoubleWritable) get()[idx]; 
} 
} 
  
    public static void main(String[] args) throws Exception {  
      Job job = Job.getInstance();  
      /* Job job = new Job(new Configuration()); */
      job.setNumReduceTasks(1);
      job.setJarByClass(TpchQ1.class);
      job.setJobName("tpchq1");  
      FileInputFormat.addInputPath(job, new Path(args[0]));  
      FileOutputFormat.setOutputPath(job, new Path(args[1]));  
      job.setMapOutputKeyClass(Text.class);  
      job.setMapOutputValueClass(DoubleArrayWritable.class);  
  
      job.setOutputKeyClass(Text.class);  
      job.setOutputValueClass(Text.class);  
  
      job.setMapperClass(Map.class);  
      job.setCombinerClass(Combine.class);
      job.setReducerClass(Reduce.class);  
  
System.exit(job.waitForCompletion(true) ? 0 : 1);
    }  
}

In my previous post you can find details of my session at Oracle Openworld 2012 where I will be talking about Big Data and Hadoop. Here I want to take a closer look at a specific existing query that runs against a relational database and instead write it as a MapReduce job to run under Hadoop as i cover in this session. I want to do this to work with an example more familiar to people who already work with databases to get a feel for contrasting technologies. Before doing so you should note that Hive can do this translation from a SQL like language for us so I will look a look at this later instead starting from the ground up.

 

I will not be covering an introduction to Big Data, you can find that here (add link), or covering the installation of Hadoop, you can find a lot of installation guides online. However it is useful to know that for testing I installed the Cloudera CDH4 distribution of Hadoop in pseudo-distributed mode on Oracle Linux 5.6 on a 2 socket Intel Xeon E5-2680 server with an Intel SSD 910 for storage. For a clustered environment I also ran the same workload on an Oracle Big Data Appliance which I cover in my OOW session.

 

Firstly to run the type of test that I want to do you need a query to port and a dataset. Fortunately there is the TPC-H workload that gives us everything we need so I will take a look at TPC-H Query 1. TPCH- Query 1 is the pricing summary report query giving the summary pricing report for lineitems shipped as of a given date. I haven't chosen query by random however, it has the advatange of being very simple to work with for purposes of an example. In particular the query accesses the data from just one table, the lineitem table and requires only one full table scan or pass through the data to satisfy the query so there are no joins or multiple MapReduce jobs to replicate the same workload in Hadoop.

 

In SQL Query 1 (in Oracle Database format) looks as follows:

 

 

select 
l_returnflag, 
l_linestatus, 
sum(l_quantity) as sum_qty, 
sum(l_extendedprice) as sum_base_price, 
sum(l_extendedprice * (1 - l_discount)) as sum_disc_price, 
sum(l_extendedprice * (1 - l_discount) * (1 + l_tax)) as sum_charge, 
avg(l_quantity) as avg_qty, 
avg(l_extendedprice) as avg_price, 
avg(l_discount) as avg_disc, 
count(*) as count_order from lineitem where l_shipdate <= date '1998-12-01' - interval '112' day (3) 
group by l_returnflag, l_linestatus 
order by l_returnflag, l_linestatus

 

Note that the l_shipdate in the where clause can change but in this example l_shipdate in the where clause is <= 11-AUG-98. The easisest way to check this is from within the Oracle database, for example:

 

 

SQL> select date '1998-12-01' - interval '112' day (3) from dual;
DATE'1998
---------
11-AUG-98

 

The statement is a number of aggregate functions (namely SUM(), AVG(), COUNT()) grouped and ordered by l_returnflag, l_linestatus so these will need to be implemented in Hadoop.

 

As there is only one table queried (from lineitem) only the data for this table is required.  To create the data for this table download and compile the dbgen utility from the TPC and generate the lineitem.tbl data file as follows:

 

 

[oracle@sandep1 ~]$ ./dbgen -s 100 -T L

 

I have chosen a Scale Factor 100 schema resulting in a lineitem file of almost 80GB containing approximately 600 million rows.

 

 

[oracle@sandep1 input]$ ls -l lineitem.tbl 
-rw-r--r-- 1 hadoop hadoop 79579694556 Jul 13 11:26 lineitem.tbl

 

 

The first five lines of this file look as follows:

 

 

1|15518935|768951|1|17|33203.72|0.04|0.02|N|O|1996-03-13|1996-02-12|1996-03-22|DELIVER IN PERSON|TRUCK|egular courts above the|
1|6730908|730909|2|36|69788.52|0.09|0.06|N|O|1996-04-12|1996-02-28|1996-04-20|TAKE BACK RETURN|MAIL|ly final dependencies: slyly bold |
1|6369978|369979|3|8|16381.28|0.10|0.02|N|O|1996-01-29|1996-03-05|1996-01-31|TAKE BACK RETURN|REG AIR|riously. regular, express dep|
1|213150|463151|4|28|29767.92|0.09|0.06|N|O|1996-04-21|1996-03-30|1996-05-16|NONE|AIR|lites. fluffily even de|
1|2402664|152671|5|24|37596.96|0.10|0.04|N|O|1996-03-30|1996-03-14|1996-04-01|NONE|FOB| pending foxes. slyly re|

 

and if you were to load this data into an Oracle database the fields in the lineitem.tbl file would correspond as follows to the lineitem table (You need to know this in order to know what fields to extract when writing a job for Hadoop).

 

 

CREATE TABLE lineitem (L_ORDERKEY  NUMBER(10),  
                       L_PARTKEY NUMBER(10),  
                       L_SUPPKEY NUMBER(10),  
                       L_LINENUMBER  NUMBER(38),  
                       L_QUANTITY NUMBER,  
                       L_EXTENDEDPRICE   NUMBER,  
                       L_DISCOUNT NUMBER,  
                       L_TAX  NUMBER,  
                       L_RETURNFLAG  CHAR(1),  
                       L_LINESTATUS CHAR(1),  
                       L_SHIPDATE  VARCHAR2(10),  
                       L_COMMITDATE VARCHAR2(10),  
                       L_RECEIPTDATE  VARCHAR2(10),  
                       L_SHIPINSTRUCT  VARCHAR2(25),  
                       L_SHIPMODE VARCHAR2(10),  
                       L_COMMENT VARCHAR2(44)) 

 

 

 

However as the data is unstructured you can go ahead and load it straight into the Hadoop distributed file system. The following example specifies a 256MB block size.

 

 

/hadoop/hadoop-2.0.0-cdh4.0.1/bin/hadoop dfs -Ddfs.block.size=268435456 -put /hadoop2/input/lineitem.tbl /user/input/tpch/lineitem256.tbl

 

 

Now we have both the sample data loaded into Hadoop and an example query in the next post I will look at rewriting this same query as a MapReduce job.

In Pt1 of this blog post I looked at a SQL Query and data set to run in Hadoop and in Pt2 wrote the Map function to extract the relevant fields from the data set to satisfy the query. At this point however we still have not implemented any of the aggregate functions and still have a large key and value intermediate data set. The only data eliminated so far has been the lines examined where the date was not less than or equal to 11-AUG-98. On the test data set out of the initial 600037902 lines of data we now have 586996074 lines remaining, to complete the query we now need to write the reduce phase.
The Reduce method will extend the Reducer class. This needs to accept the intermediate key value pairs output by the mapper and therefore will receive as input the key which is fields 9 and 10  concatenated and the DoubleArrayWritable containing the values. For every key we need to iterate through the values and calcuate the totals required for the SUM(), AVG() and COUNT() functions. Once these have been calculated we can format the output as text to be written to a file that will give us exactly the same result as if the query had been processed by a relational database. This reduce phase will look something as follows by simply adding all of the values in the array for the SUM() functions and then dividing by the COUNT() value to calculate the result of the AVG() functions.

 

 

for (DoubleArrayWritable val : values) {
x = (DoubleWritable[]) val.toArray();
sum_qty += x[0].get();
sum_base_price += x[1].get();
sum_discount += x[2].get();
count_star += x[3].get();
sum_disc_price += x[4].get();
sum_charge += x[5].get();
        }
avg_qty = sum_qty/count_star;
avg_price = sum_base_price/count_star;
avg_disc = sum_discount/count_star;
/* Format and collect the output */
Text tpchq1redval = new Text(" "+sum_qty+" "+sum_base_price+" "+sum_disc_price+" "+sum_charge+" "+avg_qty+" "+avg_price+" "+avg_disc+" "+count_star);
       context.write(key, tpchq1redval);  
     }  
}

 

Coupled with the Map phase and a Job Control section (this will be covered in the next post on running the job) this Job is now ready to run. However as we have noted previously just for our 100GB data set the map phase will output over 58 million lines of data which will involve a lot of network traffic and disk writes. We can make this more efficient by writing a Combiner.


The Combiner also extends the Reducer and in simple cases but not all (as we will cover in a moment) can be exactly the same as the Reducer. The aim of the combiner is to perform a Reducer type operation on the subset of data produced by each Mapper which will then minimise the amount of data that needs to be transferred throughout the cluster from Map to Reduce. The single most important thing about the Combiner is that there is no certainty that it will run. It is available as an optimization but for a particular Map output it might not run at all and there is no way to force it to run. From a development perspective this has important consequences, you should be able to comment out the line in the Job Control section that calls the Combiner and the result produced by the MapReduce Job stays exactly the same. Additionally the input fields for the Combiner must be exactly the same as expected by the Reducer to operate on the Map output and the Combiner output must also correspond to the input expected by the Reducer.  If you Combiner does not adhere to these restrictions your job may compile and run and you will not receive an error, however if not implemented correctly your results may change on each run from additional factors such as changing the imput block size. Finally the Combiner operation must be both commutative and associative. In other words the Combiner operation must ensure that both changing the order of the operands as well as the grouping of the operations you perform does not change the result. In our example the SUM() function is both commutative and associative, the numbers can be summed in any order and we can perform the sum operation on different groups and the result will always remain the same. AVG() on the other hand is commutative but not associative. We can calculate the average with the input data in any order, however we cannot take an average of smaller groups of values and then take the average of this intermediate data and expect the result to be the same. For this reason the Combiner can perform the SUM() operation but not the AVG() and can look as follows producing the intermediate sum values only for the Reducer.

 

 

for (DoubleArrayWritable val : values) { 
x = (DoubleWritable[]) val.toArray();
sum_qty += x[0].get();
sum_base_price += x[1].get();
sum_discount += x[2].get();
count_star += x[3].get();
sum_disc_price += x[4].get();
sum_charge += x[5].get();
  }
outArray[0] = new DoubleWritable(sum_qty); 
outArray[1] = new DoubleWritable(sum_base_price); 
outArray[2] = new DoubleWritable(sum_discount); 
outArray[3] = new DoubleWritable(count_star); 
outArray[4] = new DoubleWritable(sum_disc_price);
outArray[5] = new DoubleWritable(sum_charge);
DoubleArrayWritable da = new DoubleArrayWritable();
da.set(outArray);
context.write(key, da);
     }  

 

At this stage we have written the Mapper, Reducer and Combiner and in Pt4 will look at adding the Job Control section to produce the completed MapReduce job. We will then consider compiling and running the job and tuning for performance.

In my previous post I wrote a simple query to convert to a MapReduce job to run under Hadoop and also described to generate and load the data to test against so if you have not done that already then start with Pt1. Now we need to write the job. To do this we need to write a map function and a reduce function both accepting key-value pairs for both input and output with the map function producing an intermediate output accepted by the reducer for input. There are options to use Hadoop Straming or Hadoop Pipes to use languages other than Java however in this case we will use Java to look at the default choice. if you don't have  strong Java skills then for the sake of example, this is not a problem, most administrators who already work with SQL and relational databases will not have Java as a core skill and therefore adapting to different languages is part of the evaluation process in adopting new technologies.

 

The aim for this example is more on functionality than achieving the highest level of optimisation possible. I should note that there is an existing example of TPC-H Query 1 written for Hadoop of which I am aware here, although it does not model the whole query so I will do it slighlty differently.

 

The Map method will extend the Mapper class. As we have noted this needs to accept key value pairs as input and write key value pairs as an intermediate output. From Pt1 we have already seen the format of the unstructured data, the first line of which is repeated here:

 

 

1|15518935|768951|1|17|33203.72|0.04|0.02|N|O|1996-03-13|1996-02-12|1996-03-22|DELIVER IN PERSON|TRUCK|egular courts above the|

 

 

We need to define the key and values and extract these from each line of the data. Identifiying the key is straightforward, the SQL statement groups by l_returnflag, l_linestatus and therefore using the create table definition in Pt1 for reference we can identify these as fields 9 and 10 which in the line above are "N|O". These can be concatenated in text format as the key. Next we need to identify the values or the data that will be manipulated to form the values, working from the SQL Statement these are l_quantity, l_extendedprice, l_discount and l_tax which in the line above are fields 5 to 8 or "17|33203.72|0.04|0.02". We also want to restrict the data from the map phase to "where l_shipdate <= date '1998-12-01' - interval '112' day (3)". As we saw in Pt2 this equates to a date of less than or equal to 11-AUG-98 and therefore we also need to examine l_shipdate which in the line above is field 11 or "1996-03-13" to only output from the map stage data that qualifies. After the map stage this date value is no longer required, however if the date matches the where clause we can also calculate the discount price as (l_extendedprice * (1 - l_discount)) and the charge as (l_extendedprice * (1 - l_discount) * (1 + l_tax)) for that particular line to include in the map output. We could of course do these calculations earlier for every single line but doing it only if the where clause matches saves a small amount of unecessary work. Finally we need to maintain a count of lines which as we run the mapper on each line means a count of 1 is recorded as an output value. We can store all of the output values as an array that corresponds to the text key.

 

The following extract implements this approach in Java. We are parsing each line and creating a key by concatenating fields 9 and 10, the values are stored in a DoubleArrayWritable. Fields 5,6 and 7 are stored directly and we also examine the date in field 11 and if this meets the correct criteria we calculate the discount_price and charge and store it along with the row count along with the fields previously stored.

 

 

String line = value.toString();  
StringTokenizer tokenizer = new StringTokenizer(line,"|");  
int i = 1;
double l_extendedprice = 0.0;
double l_discount = 0.0;
double l_tax = 0.0;
        while (tokenizer.hasMoreTokens()) {  
String token = tokenizer.nextToken();
if(i == 5) {
outArray[0] = new DoubleWritable(Double.parseDouble(token)); 
}
if(i == 6) {
outArray[1] = new DoubleWritable(Double.parseDouble(token)); 
}
else if (i == 7) {
outArray[2] = new DoubleWritable(Double.parseDouble(token)); 
}
else if ( i == 8 ) {
l_tax = (Double.parseDouble(token)); 
}
else if (i == 9) {
tpchq1key.set(tpchq1key.toString()+" "+token);
}
else if (i == 10) {
tpchq1key.set(tpchq1key.toString()+" "+token);
 }
else if (i == 11) {
StringTokenizer wheredate = new StringTokenizer(token,"-");
win.set(Integer.parseInt(wheredate.nextToken()),Integer.parseInt(wheredate.nextToken()),Integer.parseInt(wheredate.nextToken()));
 }
 i++;
    }  
if(win.getTimeInMillis() <= wc.getTimeInMillis()) {
l_extendedprice = outArray[1].get();
l_discount = outArray[2].get();
double sum_disc_price = (l_extendedprice * (1 - l_discount));
double sum_charge = (l_extendedprice * (1 - l_discount) * (1 + l_tax));
outArray[3] = new DoubleWritable(1.0);
outArray[4] = new DoubleWritable(sum_disc_price);
outArray[5] = new DoubleWritable(sum_charge);
DoubleArrayWritable da = new DoubleArrayWritable();
da.set(outArray);
context.write(tpchq1key, da);

 

Now we have written the map phase and have output the required intermediate key, value pair however so far we have worked on each line of input individually so have not as yet implemented any of the aggregate functions. We will look at this in the reduce phase in Pt 3.

In Part I of this blog post on information security I investigated the performance of Oracle TDE with hardware acceleration using Intel AES-NI and saw significant performance gains. However, I subtly sidestepped the question of what happens when the data is cached? As we found during part I, Oracle does not cache the data at all, instead choosing to read from disk and decrypt every time the same query is run. Why does Oracle not cache the data after the first time the query is run?

 

Running a trace on the query and looking in the trace file helps answer the question by showing that the operation used to read the data was identified as a direct path read. This will be familiar to Oracle DBAs with Oracle parallel query experience. It means that the data is read directly into the user session's PGA (bypassing the SGA), instead of using a more familiar db file scattered read, where the data could potentially be cached but placed at the end of the LRU list and aged out more quickly if space in the buffer cache is required. Why does Oracle use a direct path read for a non-parallel query? The answer lies in Note: 793845.1 from My Oracle Support that says:

 

"There have been changes in 11g in the heuristics to choose between direct path reads or reads through buffer cache for serial table scans. In 10g, serial table scans for “large” tables used to go through cache (by default) which is not the case anymore. In 11g, this decision to read via direct path or through cache is based on the size of the table, buffer cache size and various other stats."

 

This makes a great deal of sense. Given the massive gains in platform bandwidth and latency, direct path read can be as fast as a db file scattered read and also improve scalability: There is no need to acquire a cache buffers chains latch to scan data buffered in memory to prevent that data being changed while it is in the process of being scanned. It is also beneficial in a RAC clustered environment where other nodes may be interested in the contents of the local buffer cache. As the My Oracle Support note mentions, the decision whether or not to cache the data is dependent on a number of factors. One of the most important is whether the table exceeds a size value based on the value determined by the hidden parameter _small_table_threshold. Instead of using this parameter, however, I granted the user a higher privilege and then used the event 10949 "Disable autotune direct path read for serial full table scan" to modify the default behaviour to observe its impact on clear text and encrypted data as follows:

 

SQL> alter session set events '10949 trace name context forever, level 1';

 

After doing this, re-running the same query on the clear text data does physical reads but takes slightly longer to cache.

 

Elapsed: 00:00:42.53

 

Subsequent runs shows that we have returned almost to previous performance, although not outperforming the direct path read.

 

Elapsed: 00:00:29.55

 

Now the data is cached in the buffer cache in the SGA and not read from disk.

 

Statistics
----------------------------------------------------------
...
0 physical reads
...

Tracing also showed that autotune direct path read was disabled and db file scattered read is being used.

 

I then tried setting the same event for the unencrypted data with hardware acceleration disabled. Running the query on the first occasion took slightly longer than before as the data was read from disk, decrypted, and cached in the buffer cache.

 

Elapsed: 00:02:25.97

 

However, on subsequent runs the difference was dramatic. The data was cached in clear text in memory and therefore ran considerably quicker.

 

Elapsed: 00:00:29.30

 

Similarly, with AES-NI enabled the initial read from disk and decryption took a similar length of time as before.

 

Elapsed: 00:00:45.30

 

Once the data was cached no decryption was required.

 

Elapsed: 00:00:29.49

 

To recap, the following are results when the query is cached in the SGA:

 

Clear Text Query cached = 00:00:29.55
Software only encryption cached as clear text = 00:00:29.30
AES-NI accelerated cached as clear text = 00:00:29.49 

 

In other words, the result is exactly the same and entirely expected from the TDE FAQ with cached data. Once the data is in the buffer cache it is in clear text and should therefore take a similar time to read irrespective of whether the tablespace is encrypted.

 

For Oracle encryption performance when running queries that use full table scans once the data is cached in memory (for tablespace as opposed to column encryption) it is in clear text and therefore hardware acceleration will not be used after the first read from disk. However, as we saw in Part I it would be wrong to assume that just because we size the buffer cache adequately that a table would necessarily be cached. Additionally, at this release of Oracle (11.2.0.2), whether the data is encrypted does not impact upon how full table scans are implemented. You have the manual intervention alternative and using unsupported underscore parameters to modify Oracle's behaviour. In these simple tests I have tested for only a single user, without considering the implications of scalability or clustering. If you do modify Oracle's behaviour you will need to retest that your assumptions are correct for each and every Oracle release.

 

What we want with Oracle database encryption as the name TDE implies is for it to be transparent without needing to modify practices at all just because we want the data to be encrypted. What these simple tests show is the best way to do this is by using Intel Xeon processors with AES-NI for Oracle database encryption acceleration.

One interesting aspect of working with real world implementations of database technology is that they often raise interesting questions about how that technology is used. One such recent question concerned the real performance benefits of the Intel Advanced Encryption Standard New Instructions (Intel AES-NI) set on Intel Xeon processors for improving Oracle database encryption performance with Oracle Transparent Data Encryption (TDE).

 

Oracle TDE provides security where data is automatically encrypted and decrypted when written to and read from the physical media. If we look in the Oracle TDE FAQ it clarifies what happens to the data once it has already been read and is held in memory. With TDE, column encryption data always remains encrypted in the Oracle SGA so the benefits of encryption acceleration are clear. However, with TDE tablespace encryption, data is already decrypted in the SGA so once it is cached it is in clear text. This makes the question more precise:

 

If I am using TDE tablespace encryption for query-based data and have a large SGA to cache most of the data in decrypted form, what gains will AES-NI really bring me?

 

The best way to answer this question is to put it to the test, so I tested a system equipped with Intel Xeon processor E5-2680 running Red Hat Enterprise Linux* Server release 5.6. I know these processors have AES-NI, but in the flags section of "/proc/cpuinfo" the flag "aes" confirms it for me. I installed Oracle 11.2.0.2 and applied the patch 10080579 to enable TDE to use AES-NI by default.

 

To set up TDE I then created an encryption wallet directory in my admin directory as follows:

 

/u01/app/oracle/admin/SANDEPDB1/wallet

 

 

I ensured that the permissions were set correctly to keep the directory secure.  I then created a sqlnet.ora file in my network admin directory,

 

/u01/app/oracle/product/11.2.0/dbhome_1/network/admin

 

 

and added the following line to this file (keeping the entry all on the same line):

 

ENCRYPTION_WALLET_LOCATION=(SOURCE=(METHOD=FILE)(METHOD_DATA=(DIRECTORY=/u01/app/oracle/admin/SANDEPDB1/wallet/)))

 

Then I created (you can see the file created in your wallet directory) and opened the wallet as follows:

 

[oracle@sandep1 ~]$ sqlplus / as sysdba
...
SQL> alter system set encryption key authenticated by "oraclepassword";
System altered.
SQL>
WALLET created

 

Next, I created an unencrypted tablespace as normal and an encrypted tablespace,

 

SQL> create bigfile tablespace TPCH_ENCRYPT datafile '+DATA' size 50g encryption using 'AES256' default storage(encrypt);
Tablespace created.

 

and checked the tablespace was indeed encrypted.

 

SQL> select tablespace_name, encrypted from dba_tablespaces;
TABLESPACE_NAME ENC
------------------------------ ---
SYSTEM NO
SYSAUX NO
UNDOTBS1 NO
TEMP NO
USERS NO
TPCH_CLEAR NO
TPCH_ENCRYPT YES
7 rows selected.

 

I then used Hammerora to create identical Scale Factor 10, 10GB schemas based on the TPC-H specification in both clear-text and encrypted forms, ensuring that with my 40GB buffer cache in the Oracle SGA there would be plenty of memory to cache the data.  I also used Hammerora to run and capture an example query to use against this data and keep the predicates the same so the query run would be identical each time. I used autotrace and timing to test my query performance. First I took a look at the clear text schema with the following query (which is TPC-H Query 1).

 

SQL> connect tpch/tpch
Connected.
SQL> set autotrace on;
SQL> set timing on;
SQL> select l_returnflag, l_linestatus, sum(l_quantity) as sum_qty, sum(l_extendedprice) as sum_base_price, sum(l_extendedprice * (1 - l_discount)) as sum_disc_price, sum(l_extendedprice * (1 - l_discount) * (1 + l_tax)) as sum_charge, avg(l_quantity) as avg_qty, avg(l_extendedprice) as avg_price, avg(l_discount) as avg_disc, count(*) as count_order from lineitem where l_shipdate <= date '1998-12-01' - interval '119' day (3) group by l_returnflag, l_linestatus order by l_returnflag, l_linestatus;

 

The query returned 4 rows and the timing value showed it took almost 29 seconds.

 

Elapsed: 00:00:28.80

 

Our Execution plan shows "TABLE ACCESS FULL" on "LINEITEM" which is a full table scan on our biggest table. Our statistics show that these were physical reads which means that the data was not cached in memory but instead read from disk.


Statistics
----------------------------------------------------------
...
1038269 physical reads
...

 

I then ran the same query again with the result showing the following timing.

 

 

 

Elapsed: 00:00:28.74

 

And the same execution plan value for physical reads.

 

Statistics
----------------------------------------------------------
...
1038269 physical reads
...

 

The first time, Oracle decided not to cache the LINEITEM table in the buffer cache SGA. When I ran the query a second time, it fetched the data from disk again and continued to do this for every time the query was run with consistent timing each time.

 

What about the same query on encrypted data? Oracle TDE is controlled by hidden parameters, so we need a query to see the parameters:

 

SQL> li
1* select a .ksppinm "Parameter", b.ksppstvl "Session Value", c.ksppstvl "Instance Value" from x$ksppi a, x$ksppcv b, x$ksppsv c where a.indx = b.indx AND a.indx = c.indx AND ksppinm like '%encryption%'
SQL> /
Parameter
--------------------------------------------------------------------------------
Session Value
--------------------------------------------------------------------------------
Instance Value
--------------------------------------------------------------------------------
_use_platform_encryption_lib
TRUE
TRUE
_use_hybrid_encryption_mode
TRUE
TRUE
_db_disable_temp_encryption
FALSE
FALSE

 

To use AES-NI  the parameter _use_platform_encryption_lib needs to be set to TRUE. To use AES-NI for both encryption and decryption _use_hybrid_encryption_mode needs to be set to FALSE. We can test the impact of hardware encryption acceleration performance by turning these parameters on and off, for example:

 

SQL> alter system set "_use_platform_encryption_lib"=FALSE scope=both;
System altered.

 

I restarted the database and opened the wallet before running queries against the encrypted data without AES-NI for hardware encryption acceleration. In this case Oracle is using software only to do the decryption and is not using the AES-NI at all.

 

SQL> alter system set wallet open identified by "oraclepassword";
System altered.

 

 

The query again returned 4 rows and the timing value is 136 seconds, including the time to do the software only decryption.

 

Elapsed: 00:02:16.41

 

Our Execution plan shows "TABLE ACCESS FULL" on "LINEITEM" and again our statistics show that the full table scan was based on physical reads. Each time that the same statement was re-executed, the data was read from disk. This means that each time the data was decrypted with software acceleration it took approximately 4.7x longer than the same query on clear text.

 

Next, I restarted the database but this time enabled hardware accelerated encryption to use AES-NI.

 

SQL> alter system set "_use_platform_encryption_lib"=TRUE scope=both;
System altered.

 

I re-ran the same query with the following timing value:

 

Elapsed: 00:00:45.30

 

As expected, our Execution plan again shows "TABLE ACESS FULL" on "LINEITEM" with the full table scan based on physical reads. The same query was run each each time (the data was read from disk and decrypted). In this case the difference was making use of AES-NI for acceleration.

 

 

Timing the same query consistently gave us the following results:

 

  • Query on Clear Text = 00:00:28.80
  • Software Only Encryption = 00:02:16.41
  • Accelerated with AES-NI = 00:00:45.30

 

 

This means that with AES-NI the query completed 3x faster than that using software only encryption and 1.55x slower compared to using no encryption at all. Clearly if you are using Oracle TDE for encryption then you are going to see significant performance gains from using AES-NI for acceleration.

 

If you are curious why Oracle decided not to cache the data at all and instead chose to read from disk and decrypt exactly the same query each time, we'll look into that in Part II of this post.

Many of you who are interested in the subject of optimizing platforms for database performance will be familiar with Kevin Closson's blog on platforms, databases and storage. Kevin knows his subject incredibly well, so I was interested in the subject of a recent post discussing the topic of whether you should put your Oracle database redo logs on solid state disks (SSDs).  Kevin also refers to a post by Guy Harrison, where Guy's concludes, "I’m yet to see any evidence that putting redo logs on SSD is a good idea".  This got me really interested because for the last couple of years, I have put my Oracle redo logs or SQL Server transaction log on nothing but SSDs! Maybe I've been missing something here, and with the increase of customers testing their environments on SSDs, I decided to put it to the test.

 

In our labs we have a 4 socket Xeon E7 based server, which I hooked up to both SSD and regular Hard Disk Drive (HDD) storage. To be precise for the redo SSD storage, I configured an Intel® RAID Controller RS2BL080  with 8 x Intel X25-E Extreme SATA SSD (model SSDSA2SH064G1GC) in a RAID 0 configuration. For the HDD redo, I used a well-known brand of storage array configured with 15 x 15KRPM Seagate Cheetah FC 4Gb/s hard drives, the host was connected by an Emulex Zephyr Lighpulse Fibre Channel Host Adapter to the HDD storage array. Again, all 15 drives were configured as RAID 0.

 

Note that this is not a like-for-like comparison and there are some differences between them, obviously there were 15 HDDs compared to 8 SSDs but also for example the RS2BL080 has 512MB of cache compared to the 4GB on the storage array, however I was not aiming to definitively do a 'which is better' test, the aim was to see whether putting redo on SSD (as I have been doing to date) is a good idea or not using the HDD confugration as a control. Also the aim was not to prepare any special configuration, all the options for configuring the storage were taken at the default for every step of the way.

 

On to the test I used Oracle Linux 5.6 and Oracle 11.2.0.2, and created a database with the data storage area on ASM, the data storage area was also on duplicate SSD storage with the same hardware configuration as the SSD redo area. Then, I configured redo logs on both the SSD and HDD redo area, first with raw devices and then with ASM. I used Hammerora for the workload and configured the schema for the in-built OLTP test. After that, I created multiple sessions without keying and thinking time and ran 4 timed tests on SSD/RAW, HDD/RAW, SSD/ASM, HDD/ASM collecting the AWR reports as I went. As expected, the tests generated a significant amount of redo up to 50GB over a 5 minute period. The Oracle transaction rate was in the multiple tens of thousands per second (so in the millions per minute). The redo log files were configured large enough so there was no log file switches encountered during each test to interrupt the redo.

 

Looking at the AWR report for "IOStat by Filetype" section showed the following results:

 

 

TestData/Sec (MB)Reqs/Sec
SSD/RAW17024582
HDD/RAW1303252
SSD/ASM16820506
HDD/ASM1303252

 

 

Looking at the actual Iostat information recorded during the ASM tests showed the following (which corresponds to the ASM data):

 

SSD

 

Devicew/swMB/savgrq-szavqqu-szavwaitsvctm%util
sdb121357.33167.8616.101.510.070.0244.53

 

HDD

Devicew/swMB/savgrq-szavqqu-szavwaitsvctm%util
sdd13343.00130.6880.063.250.970.2583.97

 

So, the redo on the 8 x SSD drives is writing 1.28X more data per second and doing 6.4X the writes/second although the avgrq-sz shows that the HDD configuration is writing more data for each operation. However, the avwait, svctm and %util show the the HDD configuration is busier and responding slower. The log file sync times in the Top 5 Timed Events looked as follows:

 

SSD

 

 

EventWaitsTime(s)Avg wait (ms)% DB timeWait Class
DB CPU19,83278.42
log file sync6,700,2424,059116.05Commit

 

 

HDD

 

 

EventWaitsTime(s)Avg wait (ms)% DB timeWait Class
DB CPU14,25552.53
log file sync5,366,37612,709246.83Commit

 


It has to be said that both are pretty good but not surprisingly from this redo related data when using the redo on the 8 x SSD drives the system is processing 1.25X the number of transactions as it does when the redo is on the 15 x HDD drives for exactly the same high throughput workload.  ASM does have more impact on the SSD configuration than the HDD configuration but not significantly in this particular case with 11.2.0.2.

 

What about changing the redo block size?  You can check your disk sector size in (for my case the SSD disk as sdb)  /sys/block/sdb/queue. Note as this is a LUN this is presented by the RAID card although should be related to the physical disk sector size which in the case of the Intel X25-E as shown in the datasheet is 512 bytes

 

 

[root@wesex1 queue]# pwd
/sys/block/sdb/queue
[root@wesex1 queue]# cat physical_block_size
512
[root@wesex1 queue]# cat logical_block_size
512

 

 


 

Unless this is different from the standard 512 bytes Oracle will not permit you to change it and with the SSD configuration used the default options are the correct ones.

 

 

SQL> alter database add logfile group 1 '+REDO' size 90g blocksize 4096;
alter database add logfile group 1 '+REDO' size 90g blocksize 4096
*
ERROR at line 1:
ORA-01378: The logical block size (4096) of file +REDO is not compatible with
the disk sector size (media sector size is 512 and host sector size is 512)
SQL> alter database add logfile group 1 '+REDO' size 90g blocksize 512;
Database altered. 

 

 

 

 

Not too surprisingly, with the same system and disks very similar results are found with SQL Server on Windows and the Transaction Log on NTFS on SSDs - a different operating system and database but still excellent performance from exactly the same storage configuration.


 

In conclusion, should you put your database redo and transaction logs on SSDs a good idea? According to my findings yes. I suggest using the Intel X25-E Extreme SATA Solid-State Drive.

 

 

In this case, why does there seem to be a difference in opinion? Maybe I should rephrase my conclusion in that what I found precisely was that putting database redo and transaction logs on a RAID configuration of a small number of Intel X-25E SSDs gave me excellent database throughput. However, not all SSDs are the same and so I can't draw any conclusions about any other type of flash storage being suitable for redo or transaction logs from these tests. My recommendation is always to test your own setup.

So the theme here at Oracle Openworld 2011 is "engineered for innovation" and its great to see Intel Xeon platforms dominate the line-up of engineered solutions. Its also great to know that taking an open approach you can engineer your own solution of choice and my session on performance and scalability with Eric Wan focused on some tips on how to go about achieving this. For example I have a had a number of follow up questions around turbo boost, so here is the link from Julian Dyke (my co-author on Pro Oracle 10g RAC on Linux) and here is Julian's PL/SQL.

 

SET SERVEROUTPUT ON
SET TIMING ON
DECLARE
  n NUMBER := 0;
BEGIN
  FOR f IN 1..10000000
  LOOP
    n := MOD (n,999999) + SQRT (f);
  END LOOP;
  DBMS_OUTPUT.PUT_LINE ('Res = '||TO_CHAR (n,'999999.99'));
END;
/

 

If you were in the session you will know the difference between performance and scalability and this routine is single threaded so the focus is most definitely on the single threaded performance side but gives a great insight into the Intel turbo boost feature. Of course calcuating mathematical routines to measure performance are not new but I really like Julian's idea of a test for Oracle DBA's that can be run on any Oracle instance on any system including an ASM instance.

So what is Turbo Boost?  Turbo Boost allows an increase to the frequency of your cores when operating conditions allow, From an Oracle perspective the benefits are mostly seen when running single-threaded queries or PL/SQL.

Your first port of call should be ark.intel.com to see if your CPU supports turbo boost so for example checking here on the site I can see that the E7-8870 has a clock speed of 2.4GHz and  a max turbo frequency of 2.8GHz. If your CPU does support turbo boost you should then check your BIOS settings to see if it is enabled on your system. (You have checked your BIOS settings for maximum Oracle performance haven't you?)

If you have turbo boost and its enabled if you are running Oracle on Linux you might think you can look in /proc/cpuinfo to view the turbo boost frequency, after all it does change dynamically but you won't actually see it here, you need to get the turbostat utility that is included in the pmtools package to view the turbo boost frequency.

So now using the PL/SQL test in the session we showed an example of the Intel Xeon Processor 5680 with a clock speed of 3.33GHz and a Max turbo frequency of 3.6GHz. This is a frequency boost in this case of 1.08X and the PL/SQL test ran in 8.89 secs with turbo boost off (at the BIOS) and 8.21 with it enabled (again at the BIOS level). An Oracle performance gain of 1.08X that directly corresponds to the CPU feature.

Remember this gain is on one core so when you are running Oracle workloads these incremental gains add up to process your workload faster and now you know how to test for whether you have turbo boosted your Oracle workload.

It has been a while since my first blog post on Oracle database performance, so what better time to follow up again than my next appearance at Oracle Openworld 2011!

 

Where was I in 2010? I was presenting at Linuxcon in Boston. This year, I'm back at Openworld. Although the title is similar, so much has changed.

 

To start off, here are the logistics:

 

Oracle Database Performance and Scalability on Intel Xeon Platforms

Presenters:  Steve Shaw and Eric Wan
ID#: 31701
Track: Database
Date:  3 October 2011
Time: 11:00 - 12:00
Venue: Moscone South
Room: Room 300

 

We'll talk about defining both performance and scalability, and what is desirable. We'll also look at getting the basics right at the system and OS settings. Then, we'll take a walk through the platform, as we look at the CPU and see measurements of the impact of features such as turbo boost, hyper-threading and 4 and 8 socket scalability tested against an Oracle database.

 

We will also cover memory performance, including options such as NUMA settings. At the I/O level, we'll look at SSDs. Finally, we will show an approach to holistic performance testing and engineering the system as a whole to work together for optimal performance and scalability.

 

If you don't get a chance to stop by the presentation, I'll be at the Intel booth with Eric at 4.30pm to go in-depth on some of the topics. Even if you can't make it to Openworld 2011, I plan to follow up on Openworld topics and some of the areas we will speak about on this blog.

If you are attending Oracle Openworld 2009 and are interested in learning more about Oracle Database performance on Linux on Intel platforms then I will be talking on this topic in my session on Monday 12th Oct as follows:

 

 

Presenter:  Steve Shaw

ID#: S312645
Title: Oracle Database Performance on Linux: Tips, Tools, and Tuning for Intel Platforms
Track: Database
Date: 12-OCT-09
Time: 11:30 - 12:30
Venue: Moscone South
Room: Room 303

 

I will talking about the latest platform features, how you can get the best out of them on Linux  and also showing worked and customer examples of how these can benefit you with 'real world' Oracle installations.

Steve

Filter Blog

By author:
By date:
By tag: