Archiving and Retrieving Emails using HBase and Phoenix

One of the interesting problems for any organisation is the storage and retrieval of emails. There are some obvious fields, like the subject, the sender, the content which need to be searched. However, the header lines may be very important as well. For example, it is easy to fake the sender so the server which sent the mail may be significant.

An email's core components are the header, the content and the attachments. Each line of the haeader is a key-value pair, where each key represents a header field. There are some user visible fields, like the subject, the sender, the recipients, the date and time. Other fields may be examined in case of problems and a user usually needs to make an additional effort to look at them on most graphical mail clients.

Planning for Storage

It is very reasonable to store each email as a record in a data store.

The list of possible header fields is very large. Mapping each one to a column would just not be convenient.

This is where the concept of the column family in HBase can provide an answer. For example, you may create an hbase table with the following column families:

  • Envelope (header fields like sender, recipients, subject, date)

  • Header (header fields not included in the envelope)

  • Content

  • Attachment

Create the hbase table from hbase shell as follows:

[fedora@h-mstr hbase-0.96.2-hadoop2]$ bin/hbase shell

hbase(main):001:0> create 'emails','envelope','header','content','attachment'

0 row(s) in 5.0390 seconds

=> Hbase::Table - emails

hbase(main):002:0> exit

Loading Emails

Python has a very versatile mailbox module for working with various mail formats on disk. As an illustration, consider messages stored in the mbox format. A message may be multi-part and each part may be another message or an attachment. In order to keep the code as simple as possible, convert and store each part as a string. Ideally, the unique key for each email would not be a UUID as has been used in this example. It would start with a value useful for limiting the search of the emails. For example in a corporate setup, it may be prefixed with the department and date.

As in the previous installment, use happybase to connect to the thrift server of hbase. The following code in 'load_mbox_file.py' illustrates the idea.

#! /usr/bin/python

import mailbox

import happybase

import uuid

import sys

ENVELOPE=['To','From','Subject','Return-Path','Date']

# A simple approach to mutipart

# Store each part as a string ignoring the type of content

def multipart_payloads(columns, msg):

num = 1

for part in msg.get_payload():

columns['content:' + str(num)] = part.as_string()

num += 1

def store(table,msg):

# create a unique id for the row

row_id = str(uuid.uuid1())

# Process the headers

columns = {}

# some header keywords may appear multiple times

for key in set(msg.keys()):

if key in ENVELOPE:

cf = 'envelope:' + key

else:

cf = 'header:' + key

# Get all the entries for a key as a list

# Store as a string.

# Can use eval to get the list back.

columns[cf] = str(msg.get_all(key))

# Multipart?

if msg.is_multipart():

multipart_payloads(columns,msg)

else:

columns['content:text'] = msg.get_payload()

table.put(row_id, columns)


connection = happybase.Connection('h-mstr')

table = connection.table('emails')

mbox=mailbox.mbox(sys.argv[1])

for message in mbox:

store(table,message)


Run 'load_mbox_file.py' with the mbox filename as a parameter. The data should be stored in hbase table 'emails'.

In the above example, attachments have been treated like any other content. Hence, the column family 'attachment' has been ignored.

Searching the Emails

There are quite a few options for sql queries on hbase tables, including Apache Phoenix, Impala, Hbase with Hive, Apache Drill. Apache Drill is interesting as is seems to be very versatile and flexible. However, at present, it only works with hbase 0.94, while the current version is 0.98. So, let us experiment with Apache Phoenix instead (http://phoenix.apache.org/), which is a sql layer specifically meant for hbase.

Installation involves copying the phoenix-*.jar files in the hbase lib directory and restarting hbase. Phoenix comes with a python utility, sqlline.py. You may create and populate a table in the utility using the standard sql syntax and it will create a hbase table and populate it.

You would want to use the existing hbase table, emails, by creating a view of it. For example,

[fedora@h-mstr phoenix-4.2.2-bin]$ bin/sqlline.py localhost

0: jdbc:phoenix:localhost> CREATE VIEW eview (pk VARCHAR PRIMARY KEY, "envelope"."Subject" VARCHAR, "envelope"."To" VARCHAR, "envelope"."From" VARCHAR, "envelope"."Date" VARCHAR,"header"."Content-Type" VARCHAR) AS SELECT * FROM "emails";

The double quotes are needed as sqlline.py assumes that the table and column names are in upper case. Column names of the view will be the qualifiers of the column families.

Now, you can run some queries on it:

0: jdbc:phoenix:localhost> select * from eview where "Subject" is not null limit 5;

0: jdbc:phoenix:localhost> select * from eview where "Content-Type" ilike '%multipart%' limit 5;

0: jdbc:phoenix:localhost> select "From", "Subject" from eview where "Date" ilike '%jan%2011%' limit 5;

In case you need to examine some other column in a column family, create another view. You can find out more about the options available at http://phoenix.apache.org/language/, in particular, you can create secondary indices on a view.

It would be easy to include the storage of text messages, other than emails, as well in the above system.

You would probably still index the contents and attachments using elasticsearch or a similar tool. Combining the text search with the sql query layer on hbase makes it possible to offer remarkable solutions to your users, analysing and searching massive volumes of data with minimal effort.

Comments