One of the interesting problems for any organisation is the storage and retrieval of emails. There are some obvious fields, like the subject, the sender, the content which need to be searched. However, the header lines may be very important as well. For example, it is easy to fake the sender so the server which sent the mail may be significant. An email's core components are the header, the content and the attachments. Each line of the haeader is a key-value pair, where each key represents a header field. There are some user visible fields, like the subject, the sender, the recipients, the date and time. Other fields may be examined in case of problems and a user usually needs to make an additional effort to look at them on most graphical mail clients. Planning for StorageIt is very reasonable to store each email as a record in a data store. The list of possible header fields is very large. Mapping each one to a column would just not be convenient. This is where the concept of the column family in HBase can provide an answer. For example, you may create an hbase table with the following column families:
Create the hbase table from hbase shell as follows:
Loading EmailsPython has a very versatile mailbox module for working with various mail formats on disk. As an illustration, consider messages stored in the mbox format. A message may be multi-part and each part may be another message or an attachment. In order to keep the code as simple as possible, convert and store each part as a string. Ideally, the unique key for each email would not be a UUID as has been used in this example. It would start with a value useful for limiting the search of the emails. For example in a corporate setup, it may be prefixed with the department and date. As in the previous installment, use happybase to connect to the thrift server of hbase. The following code in 'load_mbox_file.py' illustrates the idea.
In the above example, attachments have been treated like any other content. Hence, the column family 'attachment' has been ignored. Searching the EmailsThere are quite a few options for sql queries on hbase tables, including Apache Phoenix, Impala, Hbase with Hive, Apache Drill. Apache Drill is interesting as is seems to be very versatile and flexible. However, at present, it only works with hbase 0.94, while the current version is 0.98. So, let us experiment with Apache Phoenix instead (http://phoenix.apache.org/), which is a sql layer specifically meant for hbase. Installation involves copying the phoenix-*.jar files in the hbase lib directory and restarting hbase. Phoenix comes with a python utility, sqlline.py. You may create and populate a table in the utility using the standard sql syntax and it will create a hbase table and populate it. You would want to use the existing hbase table, emails, by creating a view of it. For example,
The double quotes are needed as sqlline.py assumes that the table and column names are in upper case. Column names of the view will be the qualifiers of the column families. Now, you can run some queries on it:
In case you need to examine some other column in a column family, create another view. You can find out more about the options available at http://phoenix.apache.org/language/, in particular, you can create secondary indices on a view. It would be easy to include the storage of text messages, other than emails, as well in the above system. You would probably still index the contents and attachments using elasticsearch or a similar tool. Combining the text search with the sql query layer on hbase makes it possible to offer remarkable solutions to your users, analysing and searching massive volumes of data with minimal effort. |