HBase and Elasticsearch

HBase (http://hbase.apache.org/) is an open source, non-relational database which runs on top of HDFS. It is modeled after Google's BigTable.

Consider the problem of storing a very large number of documents, whether OpenOffice, PDF or MS Office formats, in their original format, searching through them and being able to access any document randomly.

A key-value pair provides a convenient way to store the documents, where the key can be the file name and the value the contents of the document. You can easily use HBase as the data store for this purpose.

Elasticsearch has an attachment plugin(https://github.com/elasticsearch/elasticsearch-mapper-attachments), using which uses Tika (http://tika.apache.org/) to analyze and index the attachments.

Hence, a combination of HBase and Elasticsearch with the attachment type can help you create a flexible solution to provide an easy way to search and select from a huge number and variety of documents.

Getting Started with HBase

Currently, hbase is not available from the Fedora repository though it will be for Fedora21. So, download the current version for an Apache mirror and just follow the quick start instructions (http://hbase.apache.org/book/quickstart.html). You will want to set it up in the fully distributed mode, using the three OpenStack virtual machines, h-mstr, h-slv1 and h-slv2, as we have been doing in this series.

HBase is a Java application but comes with two servers, REST and Thrift, to allow integration with other programming environments, in particular Python. HappyBase (http://happybase.readthedocs.org) makes it easy to interface with Apache Hbase from a remote machine; so install it on the desktop.

Make sure that hdfs servers are running. Sign into fedora@h-mstr , start the HBase servers and create the 'documents' table with one column, 'content':

[fedora@h-mstr ~]$ cd hbase-0.96.2-hadoop2

[fedora@h-mstr hbase-0.96.2-hadoop2]$ bin/start-hbase.sh

[fedora@h-mstr hbase-0.96.2-hadoop2]$ bin/hbase-daemons.sh start thrift

[fedora@h-mstr hbase-0.96.2-hadoop2]$ bin/hbase shell

hbase(main):001:0> create 'documents','content'

hbase(main):002:0> exit

The following python code, store_files_in_hbase.py, scans over all files in a directory, selecting the ones you want. It then copies the file contents in a variable and stores filename as the row key and the content as the column content in 'documents' table.

#!/usr/bin/python

import sys

import os

import happybase

TABLE='documents'

FILETYPES=['odt','doc','sxw','abw','pdf']

def process_file(table,p,f):

filename = '/'.join([p,f])

filecontent = open(filename,'rb').read()

print("Processing %s %s Len %d"%(p,f,len(filecontent)))

table.put(filename,{'content:data':filecontent})

def get_documents(path):

for curr_path,dirs,files in os.walk(path):

for f in files:

try:

if f.rsplit('.',1)[1].lower() in FILETYPES:

yield curr_path,f

except:

pass

try:

path=sys.argv[1]

except IndexError:

path='.'

connection = happybase.Connection('h-mstr')

table = connection.table(TABLE)

for p,f in get_documents(path):

process_file(table,p,f)

It is best to index the documents at the same time as they are being stored in hbase. (As usual, for simplicity, exception handling has been ignored.)

Elasticsearch Attachments

Install the elasticsearch attachment plugin on each of the three virtual machines and start the elasticsearch server. For example, login as fedora@h-mstr:

[fedora@h-mstr ~]$ cd /usr/share/elasticsearch/

[fedora@h-mstr elasticsearch]$ sudo bin/plugin -install \

elasticsearch/elasticsearch-mapper-attachments/2.4.1

[fedora@h-mstr elasticsearch]$ sudo systemctl start elasticsearch

Repeat the steps on h-slv1 and h-slv2.

You will need to create an index of type attachment before you can start indexing the documents. The following curl script will create the index, 'documents':

$ curl -XPOST h-mstr:9200/documents -d '{

"mappings" : {

"attachment" : {

"_source": {"excludes" : ["file"] },

"properties" : {

"file" : {

"type" : "attachment",

"fields" : {

"title" : { "store" : "yes" },

"file" : { "term_vector":"with_positions_offsets", "store":"no" } } } } } }}'

Since you are storing and accessing the files from hbase, there is no advantage in storing the documents in the index as well. Hence, the directive to exclude 'file' from storing in the source.

Extend 'store_files_in_hbase.py' to include indexing of the document as well using https://gist.github.com/stevehanson/7462063/ as a reference:

TMP_FILE_NAME = 'tmp.json'

def create_encoded_temp_file(fname,filecontent):

import json

file64 = filecontent.encode("base64")

f = open(TMP_FILE_NAME, 'w')

data = { 'file': file64, 'title': fname }

json.dump(data, f) # dump json to tmp file

f.close()

def process_file(table,p,f):

filename = '/'.join([p,f])

filecontent = open(filename,'rb').read()

print("Processing %s %s Len %d"%(p,f,len(filecontent)))

table.put(filename,{'content:data':filecontent})

# Index the contents

create_encoded_temp_file(filename,filecontent)

# URL: http://Host/Index/Type

os.system('curl -XPOST http://h-mstr:9200/documents/attachment -d @'

+TMP_FILE_NAME)

os.remove(TMP_FILE_NAME)

If all goes well, your files will be stored in the hbase table and indexed in elasticsearch.

Search and Retrieve

You can search the index 'documents' for a keyword expression as follows:

#!/usr/bin/python

import sys

from elasticsearch import Elasticsearch

def search(es_index,query_string):

es = Elasticsearch(['h-mstr'])

query_body = {'query':

{'query_string':

{ 'query': query_string}},

'fields':['title']

}

# response is a dictionary with nested dictionaries

response = es.search(index=es_index, body=query_body)

return [hit['fields']['title'] for hit in response['hits']['hits']]

# test run the module with search expression as a parameter

if __name__=='__main__':

print search('documents',' '.join(sys.argv[1:]))

The following example lets you get the document from hbase documents table by supplying the key:

#!/usr/bin/python

import sys

import happybase

def fetch_attachment(table, full_filename):

row = table.row(full_filename)

filename = full_filename.split('/')[-1]

f = open(filename,'wb')

f.write(row['content:data'])

f.close()

return filename

# test run the module with the key (full filename) as parameter

if __name__=='__main__':

connection = happybase.Connection('h-mstr')

table = connection.table('documents')

print "Document File: ", fetch_attachment(table, sys.argv[1])

The search and fetch_attachment methods can be combined to create versatile and very useful web applications for any organisation with a very large number of documents.

Comments