Currently Being Moderated

About Chandeep Singh: I am a Software Engineer working at Intel as part of Intel Distribution for Apache Hadoop, Professional Services. I’ve been working in the Hadoop world for a while and got my hands dirty with MapReduce, Hive, Pig & Sqoop but have enjoyed every bit of it.


Here is another interesting use case that came up when I was working with one of our clients in the insurance industry. The client had enormous amount of claim data residing in multiple databases in SQL Server which were to be consolidated into one. Some of the queries on this data took days because of which we were looking for an alternate solution that could process data in a distributed fashion and save us some time. We started looking into a Hadoop based solution since the company was already using Hadoop.

 

We had few options on the table such as Hive, Pig, Hbase etc and after some brainstorming decided to go with HBase for the following reasons:

  1. It is an open source distributed database which would yield higher performance while being cost effective at the same time.
  2. We do not have to worry about distributing the data for faster processing since Hadoop takes care of it.
  3. Batch processing with no real indexes.
  4. Data integrity as HBase confirms a write after its write-ahead log reaches all the three in-memory HDFS replicas.
  5. Easily scalable, fault tolerant and highly available.

 

Now the next step was to move data from the SQL database to HDFS for which we used Sqoop. It imports all the data, stores it in CSV by default and can be used as:

 

sqoop import –connect 'jdbc:sqlserver://<ServerName>;username=<UserName>;password=<Password>;database=<Database Name>' --table <Table Name> --target-dir <Destination Path>

 

The next step was to create an HBase table and insert all the data from the CSV into it in the following ways:

 

hadoop jar <Path To HBase Jar> importtsv -Dimporttsv.columns=<Column Names> '-Dimporttsv.separator=,' <Table To Import Into> <Input Directory>


Or use the complete bulk loader:


hadoop jar hbase-VERSION.jar completebulkload [-c /path/to/hbase/config/hbase-site.xml] <Input Directory> <Table To Import Into>

 

We gained everything we had hoped for by moving to HBase but there was something still missing. Querying an HBase database was not everyone’s cup of tea and hence the process needed to be simplified. Since the insurance folks were already familiar with SQL the easiest way out was to build a Hive schema on top of the HBase table.

 

There can be two cases while creating a Hive table on top of HBase:

  1. We do not know the column names or need all the columns for which we could explode all the data into a map as key value pairs.
  2. We need only specific columns in which case we need to specify the mappings for every column.

 

Let’s look at an example:

 

The first step is to create a sample HBase table.

create 'MY_TABLE', {NAME=>'TEST', VERSIONS => '3', COMPRESSION => 'NONE', TTL => '2147483647', BLOCKSIZE => '65536', IN_MEMORY => 'false', BLOCKCACHE => 'false'}
describe 'MY_TABLE'
enable 'MY_TABLE'

 

The next step is to insert some sample data into MY_TABLE

put 'MY_TABLE' , '1' ,'TEST:mydata1' ,'value1'
put 'MY_TABLE' , '2' ,'TEST:mydata2' ,'value2'
put 'MY_TABLE' , '3' ,'TEST:mydata3' ,'value3'
put 'MY_TABLE' , '4' ,'TEST:mydata4' ,'value4'
put 'MY_TABLE' , '5' ,'TEST:mydata5' ,'value5'
put 'MY_TABLE' , '5' ,'TEST:mydata1' ,'value1'

 

As mentioned above we can create a Hive external table in two ways:


Use all columns:

CREATE EXTERNAL TABLE test_all (id string,colname map<string,string>)
STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key,TEST:")
TBLPROPERTIES("hbase.table.name" = "MY_TABLE");

hive> SELCT * FROM test_all;
1 {"mydata1":"value1"}
2 {"mydata2":"value2"}
3 {"mydata3":"value3"}
4 {"mydata4":"value4"}
5 {"mydata1":"value1","mydata5":"value5"}

or map every column by name
CREATE EXTERNAL TABLE test_map(id string,colname1 string,colname2 string)
STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key,TEST:mydata1,TEST:mydata5")
TBLPROPERTIES("hbase.table.name" = "MY_TABLE");

hive> SELECT * FROM test_map;
1       value1  NULL
5       value1  value5

 

Please let me know if you have any questions about this example but posting a comment below.  Thanks for your time

 

Chandeep

Comments

Filter Blog

By author:
By date:
By tag: