In previous posts in this series I explored the steps in writing an existing SQL query to instead run in Hadoop as a Map Reduce Job. These blog posts very much complement the slides from my session at Oracle Openworld 2012.
This series would not be complete however without mentioning other alternatives for achieving the same aim, notably Hive and Pig. here I will focus on Hive mainly as there is already detailed information on TPC-H queries including query 1 to run on Pig here. Pig provides a script based approach to writing Map Reduce jobs whereas Hive uses a SQL like interface and therefore for this particular example translating a SQL Query Hive is the most appropriate tool here without necessarily assuming that it would be better for all circumstances.
Hive automatically translates a language called HiveQL a subset of SQL into Map Reduce jobs enabling you to take advantage of your existing SQL skills, there are limitations however and Hive for example does not support all of the SQL functionality available in a relational databases such as correlated subqueries.
To use Hive you need to download and install the additional Hive package. When you run Hive you are presented with a Hive prompt with the first stage being to define the table to query. This example uses the data loaded in Pt1 of this blog series whilst noting that Hive does not support all of the data types available in an Oracle database.
CREATE EXTERNAL TABLE lineitem ( L_ORDERKEY INT ,L_PARTKEY INT ,L_SUPPKEY INT ,L_LINENUMBER INT ,L_QUANTITY INT ,L_EXTENDEDPRICE DOUBLE ,L_DISCOUNT DOUBLE ,L_TAX DOUBLE ,L_RETURNFLAG STRING ,L_LINESTATUS STRING ,L_SHIPDATE STRING ,L_COMMITDATE STRING ,L_RECEIPTDATE STRING ,L_SHIPINSTRUCT STRING ,L_SHIPMODE STRING ,L_COMMENT STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' LINES TERMINATED BY '\n' STORED AS TEXTFILE LOCATION '/user/input/tpch/256';
With the table defined you can then run your SQL at the Hive prompt. The main modification is the simplification of the l_shipdate field. Also knowing the characteristics of the Job we set the number of reducers to 1 before the query is run with the following example output.
hive> set hive.exec.reducers.max=1; hive> select > l_returnflag, > l_linestatus, > sum(l_quantity) as sum_qty, > sum(l_extendedprice) as sum_base_price, > sum(l_extendedprice * (1 - l_discount)) as sum_disc_price, > sum(l_extendedprice * (1 - l_discount) * (1 + l_tax)) as sum_charge, > avg(l_quantity) as avg_qty, > avg(l_extendedprice) as avg_price, > avg(l_discount) as avg_disc, > count(*) as count_order > from lineitem > where l_shipdate <= '1998-08-11' > group by l_returnflag, l_linestatus > order by l_returnflag, l_linestatus; Total MapReduce jobs = 2 Launching Job 1 out of 2 ... Hadoop job information for Stage-1: number of mappers: 297; number of reducers: 1 2012-09-06 16:36:20,579 Stage-1 map = 0%, reduce = 0% 2012-09-06 16:36:49,465 Stage-1 map = 1%, reduce = 0%, Cumulative CPU 311.34 sec ... 2012-09-06 16:40:10,495 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 5565.64 sec ... Launching Job 2 out of 2 ... Hadoop job information for Stage-2: number of mappers: 1; number of reducers: 1 2012-09-06 16:40:14,217 Stage-2 map = 0%, reduce = 0% ... 2012-09-06 16:40:20,425 Stage-2 map = 100%, reduce = 100%, Cumulative CPU 1.77 sec ... MapReduce Jobs Launched: Job 0: Map: 297 Reduce: 1 Accumulative CPU: 5565.64 sec HDFS Read: 0 HDFS Write: 0 SUCESS Job 1: Map: 1 Reduce: 1 Accumulative CPU: 1.77 sec HDFS Read: 0 HDFS Write: 0 SUCESS ... OK A F 3775127758 5.660776097194458E12 5.377736398183936E12 5.59284742951593E12 25.499370423275426 38236.116984304950.050002243530958 148047881 N F 98553062 1.4777109838598007E11 1.4038496596503482E11 1.4599979303277582E11 25.501556956882876 38237.1993888045250.049985284338051675 3864590 N O 7318871052 1.09746348095171E13 1.0425912729393264E13 1.084295716582229E13 25.499840883624667 38236.9684354666460.04999754512223738 287016342 R F 3775724970 5.661603032745331E12 5.378513563915406E12 5.593662252666915E12 25.50006628406532 38236.697258452910.050001304339683 148067261
It is interesting to observe that the job ran as 2 map reduce jobs instead of the one for the Job written by hand. Nevertheless the result was the same and the completion time as similar. However Hive significantly saved on the time it took to write the initial MapReduce Job to do the same thing. Consequently anyone looking to Hadoop should certainly consider tools such as Hive and Pig for writing Map Reduce jobs especially when experienced with existing database environments.