Fedora 20 makes it easy to install Hadoop. Version 2.2 is packaged and available in the standard repositories. It will place the configuration files in /etc/hadoop, with reasonable defaults so that you can get started easily. As you may expect, managing the various hadoop services is integrated with systemd.
So, you start an instance, with name h-mstr, in OpenStack using a Fedora Cloud image(http://fedoraproject.org/get-fedora#clouds). You may get an ip like 192.168.32.2. You will need to choose at least the m1.small flavor, i.e. 2GB RAM and 20GB disk. Add an entry in /etc/hosts for convenience:
Now, install and test the hadoop packages on the virtual machine following the article, http://fedoraproject.org/wiki/Changes/Hadoop:
It will download over 200mb of packages and take about 500 mb of disk space.
Create an entry in the /etc/hosts file for h-mstr using the name in /etc/hostname, e.g.
Now, you can test the installation. First, run a script to create the needed hdfs directories.
Now, start the hadoop services using systemctl.
You can find out the hdfs directories created as follows. The command may look complex but you are running the “hadoop fs” command in a shell as hadoop's internal user, hdfs.
Create a directory with the right permissions for the user, fedora, to be able to run the test scripts.
Disable firewall and iptables and run a mapreduce example. You can monitor the progress at http://h-mstr:8088/. Figure 1 shows an example running on three nodes.
The first test is to calculate pi using 10 maps and 1,000,000 samples. It took about 90 secs to estimated the value of pi to be 3.1415844.
In the next test, you create 10 million records of 100 bytes each, that is 1GB of data(~1 min), then sort it(~8 min) & finally verify it(~1 min). You may want to clean up the directories created in the process.
Stop the hadoop services before creating and working with multiple datanodes and clean up the data directories.
Testing with Multi-nodes
The following steps simplify creation of multiple instances:
Now, modify the configuration files located in /etc/hadoop.
Now, create a snapshot, Hadoop-Base. Its creation will take time. It may not give you an indication of an error if it runs out of disk space!
Launch instances h-slv1 and h-slv2 serially using Hadoop-Base as the instance boot source. Launching of the first instance from a snapshot is pretty slow. In case the IP addresses are not the same as your guess in /etc/hosts, edit /etc/hosts on each of the three nodes to the correct value. For your convenience, you may want to make entries for h-slv1 and h-slv2 on the desktop /etc/hosts file as well.
The following commands should be run from as fedora on h-mstr.
Reformat the namenode to make sure that the single node tests are not causing any unexpected issues.
Start the hadoop services on h-mstr.
Start the datanode and yarn services on the slave nodes:
Create the hdfs directories and a directory for user fedora as on a single node:
You can run the same tests as above. Although you are using 3 nodes, the improvement in the performance compared to the single node is not expected to be noticeable as it is running on a single desktop.
The pi example took about 1 minute on the three nodes compared to 90 sec earlier. Terasort took 7 minutes instead of 8.
Both OpenStack and Mapreduce are a collection of interrelated services working together. Diagnosing problems, especially, in the beginning is hard as each service has its own logfiles. It takes a while to get used to realising where to look.
However, once they are working, it is incredible how easy they make distributed processing!
Note: The timings are from an AMD Phenom II X4 965 with 16GB RAM. All virtual machines and their data was on a single physical disk.