Currently Being Moderated
guest.writer

IDH Hbase & Lucene Integration

Posted by guest.writer in The Data Stack on Aug 29, 2013 4:37:40 PM

Ritu Kama is the Director of Product Management for Big Data at Intel. She has over 15 years of experience in building software solutions for enterprises. She has led Engineering, QA and Solution Delivery organizations within Datacenter Software Division for Security and Identity products. Last year she led the Product and Program management responsibilities for Intel’s Distribution of Hadoop and Big Data solutions. Prior to joining Intel, she led technical and architecture teams at IBM and Ascom. She has a MBA degree from University of Chicago and a Bachelor’s degree in Computer Science.


Introduction

HBase is a non-relational, column-oriented database that runs on top of the Hadoop Distributed File System (HDFS). Hbase's tables contain rows and columns. Each table has an element defined as a Primary Key which is  used for all Get/Put/Scan/Delete operations on those tables. To some extent this can be a shortcoming because one may want to search within, say, a given column.

 

The IDH Integration with Lucene

The Intel® Distribution for Apache Hadoop* (IDH) solves this problem by incorporating native features that permit straightforward integration with Lucene.  Lucene is a search library that acts upon documents containing data fields and their values. The IDH-to-Lucene integration leverages the HBase Observer and Endpoint concepts, and therein lies the flexibility to access the HBase data with Lucene searches more robustly.

 

The Observers can be likened to triggers in RDBMS's, while the Endpoints share some conceptual similarity to stored procedures. The mapping of Hbase records and Lucene documents is done by a convenience class called IndexMetadata. The Hbase observer monitors data updates to the Hbase table and builds indexes synchronously. The Indexes are stored in multiple shards with each shard tied to a region. The Hbase Endpoint dispatches search requests from the client to those regions.

 

When entering data into an HBase table you'll need to create an HBase-Lucene mapping using the IndexMetadata class. During the insertion, text in the columns that are mapped get broken into indexes and stored in the Lucene index file. This process of creating the Lucene index is done automatically by the IDH implementation.  Once the Lucene index is created, you can search on any keyword. The implementation searches for the word in the Lucene index and retrieves the row ID's of the target word. Then, using those keys you can directly access the relevant rows in the database.

 

IDH's HBase-Lucene integration extends HBase's capability and provides many advantages:

1. Search not only by row key but also by values.

2. Use multiple query types such as Starts, Ends, Contains, Range, etc.

3. Ranking scores for the search are also available.

Lucene-1.jpg

Sample Code and Configuration Procedures

The following code illustrates the basic steps for implementing Hbase-Lucene integration, followed by various examples of search types that become available.

 

1.    Modify the hbase-site.xml using Intel Manager

 

Attribute

Operation

value

  1. hbase.coprocessor.region.classes
add
  • org.apache.hadoop.hbase.coprocessor.search.IndexSearcherEndpoint
  1. hbase.regionserver.handler.count
add/modify>100
  1. hbase.regionserver.coprocessorhandler.count
add/modify>10
  1. hbase.coprocessor.master.classes
add
  • org.apache.hadoop.hbase.search.LuceneMasterCoprocessor

 

2. Creating Index Meta data

lucene-2.jpg

3.  Creating a Hbase table and attaching a Index Meta data.

lucene-3.jpg

4.  Searching

lucene-4.jpg

Listed in the following table are a list of queries that can be implemented using this integration and providing the benefits described above.

lucene-5.png

To learn more about the IDH HBase-to-Lucene integration and to get a copy of source code with usage examples, please check out hadoop.intel.com/resources

 

Bye for now,

 

Ritu

Comments

Filter Blog

By author:
By date:
By tag: