ElasticSearch on OpenStack

You have a huge number of documents. It would be nice if you could search them almost as well as Google. Lucene (http://lucene.apache.org/) has been helping organisations search their data for years. Projects like elasticsearch(http://www.elasticsearch.org/) build on top of Lucene to provide distributed, scalable solutions for searching huge volumes of data. A good example is the use of elasticsearch at wordpress – http://gibrown.wordpress.com/2014/01/09/scaling-elasticsearch-part-1-overview/.

In this experiment, you start with three nodes on OpenStack – h-mstr, h-slv1 and h-slv2 as in the previous article. Download the rpm package from ElasticSearch site and install it on each of the nodes.

The configuration file is /etc/elasticsearch/elasticsearch.yml. You will need to configure it on each of the three nodes. Consider the following settings on the h-mstr node:

cluster.name: es

node.master: true

node.data: true

index.number_of_shards: 10

index.number_of_replicas: 0

You have given the name es to the cluster. The same value should be used on the h-slv1 and h-slv2 nodes. This node will act as a master and store data as well. The master nodes process the requests by distributing the search to the data nodes and consolidating the results. The next two parameters relate to the index. The number of shards is the number of sub-indices which are created and distributed among the data nodes. The default value for the number of shards is five. The number of replicas represents the additional copies of the indices created. You have set it to no replicas. The default value is one.

You may use the same values on slv1 and slv2 nodes or use node.master set to false. Once you have loaded the data, you will find that the h-mstr node has 4 shards and h-slv1 and h-slv2 have 3 shards each. The indices will be in the directory /var/lib/elasticsearch/es/nodes/0/indices/ on each node.

You start the elasticsearch on each node by executing:

$ sudo systemctl start elasticsearch

You can know the status of the cluster by browsing http://h-mstr:9200/_cluster/health?pretty.

Loading the Data

You want to index the documents located on your desktop. Elasticsearch supports a python interface for it. It is available in the Fedora 20 repository. So, on your desktop, install:

$ sudo yum install python-elasticsearch

The following is a sample program to index LibreOffice documents. The comments embedded in the code, hopefully, make it clear that it is not a complex task.

#!/usr/bin/python

import sys

import os

import subprocess

from elasticsearch import Elasticsearch



FILETYPES=['odt','doc','sxw','abw']

# Covert a document file into a text file in /tmp and return the text file name

def convert_to_text(inpath,infile):

subprocess.call(['soffice','--headless','--convert-to','txt:Text',

'--outdir','/tmp','/'.join([inpath,infile])])

return '/tmp/' + infile.rsplit('.',1)[0] + '.txt'

# Read the text tile and return it as a string

def process_file(p,f):

textfile = convert_to_text(p,f)

return ' '.join([line.strip() for line in open(textfile)])

# Search all files in a root path and select the document files

def get_documents(path):

for curr_path,dirs,files in os.walk(path):

for f in files:

try:

if f.rsplit('.',1)[1].lower() in FILETYPES:

yield curr_path,f

except:

pass

# Run this program with the root directory.

# If none, then the current directory is used.

def main(argv):

try:

path=argv[1]

except IndexError:

path='.'

es = Elasticsearch(hosts='h-mstr')

id = 0

# index each document with 3 attributes:

# path, title (the file name) and text (the text content)

for p,f in get_documents(path):

text = process_file(p,f)

doc = {'path':p, 'title':f, 'text':text}

id += 1

es.index(index='documents', doc_type='text',id=id, body= doc)

if __name__=="__main__":

main(sys.argv)

Once the index is created, you cannot increase the number of shards. However, you can change the replication value as follows:

$ curl -XPUT 'h-mstr:9200/documents/_settings' -d '

{

"index" : {

"number_of_replicas" : 1

} }

'

Now, the number of shards will be 7, 7 and 6 on the three nodes. As you would expect, if one of the nodes is down, you will still be able to search the documents. If more than one node is down, the search will return a partial result from the shards which are still available.

Searching the Data

The program search_documents.py below uses the query_string option of Elasticsearch to search for the string passed as a parameter in the content field 'text'. It returns the fields, 'path' and 'title' in the response, which are joined to print the full filenames of the documents found.

#!/usr/bin/python

import sys

from elasticsearch import Elasticsearch

def main(query_string):

es = Elasticsearch(['h-mstr'])

query_body = {'query':

{'query_string':

{ 'default_field':'text',

'query': query_string}},

'fields':['path','title']

}

# response is a dictionary with nested dictionaries

response = es.search(index='documents', body=query_body)

for hit in response['hits']['hits']:

print '/'.join(hit['fields']['path'] + hit['fields']['title'])

# run the program with search expression as a parameter

if __name__=='__main__':

main(' '.join(sys.argv[1:]))

You can now search using expressions like the following:

$ python search_documents.py smalltalk objects

$ python search_documents.py smalltalk AND objects

$ python search_documents.py +smalltalk objects

$ python search_documents.py +smalltalk +objects



More details can be found at the Lucene and ElasticSearch sites.

Open source options let you build a custom, scalable search engine. You may include information from your databases, documents, emails, etc very conveniently. Hence, it is a shame to come across sites which do not offer an easy way to search the site's content and one hopes that website managers will add that functionality using tools like ElasticSearch!

Comments