HBase (http://hbase.apache.org/) is an open source, non-relational database which runs on top of HDFS. It is modeled after Google's BigTable.
Consider the problem of storing a very large number of documents, whether OpenOffice, PDF or MS Office formats, in their original format, searching through them and being able to access any document randomly.
A key-value pair provides a convenient way to store the documents, where the key can be the file name and the value the contents of the document. You can easily use HBase as the data store for this purpose.
Elasticsearch has an attachment plugin(https://github.com/elasticsearch/elasticsearch-mapper-attachments), using which uses Tika (http://tika.apache.org/) to analyze and index the attachments.
Hence, a combination of HBase and Elasticsearch with the attachment type can help you create a flexible solution to provide an easy way to search and select from a huge number and variety of documents.
Currently, hbase is not available from the Fedora repository though it will be for Fedora21. So, download the current version for an Apache mirror and just follow the quick start instructions (http://hbase.apache.org/book/quickstart.html). You will want to set it up in the fully distributed mode, using the three OpenStack virtual machines, h-mstr, h-slv1 and h-slv2, as we have been doing in this series.
HBase is a Java application but comes with two servers, REST and Thrift, to allow integration with other programming environments, in particular Python. HappyBase (http://happybase.readthedocs.org) makes it easy to interface with Apache Hbase from a remote machine; so install it on the desktop.
Make sure that hdfs servers are running. Sign into fedora@h-mstr , start the HBase servers and create the 'documents' table with one column, 'content':
The following python code, store_files_in_hbase.py, scans over all files in a directory, selecting the ones you want. It then copies the file contents in a variable and stores filename as the row key and the content as the column content in 'documents' table.
It is best to index the documents at the same time as they are being stored in hbase. (As usual, for simplicity, exception handling has been ignored.)
Install the elasticsearch attachment plugin on each of the three virtual machines and start the elasticsearch server. For example, login as fedora@h-mstr:
Repeat the steps on h-slv1 and h-slv2.
You will need to create an index of type attachment before you can start indexing the documents. The following curl script will create the index, 'documents':
Since you are storing and accessing the files from hbase, there is no advantage in storing the documents in the index as well. Hence, the directive to exclude 'file' from storing in the source.
Extend 'store_files_in_hbase.py' to include indexing of the document as well using https://gist.github.com/stevehanson/7462063/ as a reference:
If all goes well, your files will be stored in the hbase table and indexed in elasticsearch.
You can search the index 'documents' for a keyword expression as follows:
The following example lets you get the document from hbase documents table by supplying the key:
The search and fetch_attachment methods can be combined to create versatile and very useful web applications for any organisation with a very large number of documents.