Fedora 20 makes it easy to install Hadoop. Version 2.2 is packaged and available in the standard repositories. It will place the configuration files in /etc/hadoop, with reasonable defaults so that you can get started easily. As you may expect, managing the various hadoop services is integrated with systemd. Setting up a Single NodeSo, you start an instance, with name h-mstr, in OpenStack using a Fedora Cloud image(http://fedoraproject.org/get-fedora#clouds). You may get an ip like 192.168.32.2. You will need to choose at least the m1.small flavor, i.e. 2GB RAM and 20GB disk. Add an entry in /etc/hosts for convenience: 192.168.32.2 h-mstr Now, install and test the hadoop packages on the virtual machine following the article, http://fedoraproject.org/wiki/Changes/Hadoop:
It will download over 200mb of packages and take about 500 mb of disk space. Create an entry in the /etc/hosts file for h-mstr using the name in /etc/hostname, e.g. 192.168.32.2 h-mstr
h-mstr.novalocal Now, you can test the installation. First, run a script to create the needed hdfs directories. $ sudo hdfs-create-dirs Now, start the hadoop services using systemctl.
$ sudo runuser hdfs -s /bin/bash
/bin/bash -c “hadoop fs -ls /” Found 3 items drwxrwxrwt
- hdfs supergroup 0 2014-07-15 13:21 /tmp drwxr-xr-x -
hdfs supergroup 0 2014-07-15 14:18 /user drwxr-xr-x -
hdfs supergroup 0 2014-07-15 13:22 /var Testing the Single NodeCreate a directory with the right permissions for the user, fedora, to be able to run the test scripts.
The first test is to calculate pi using 10 maps and 1,000,000 samples. It took about 90 secs to estimated the value of pi to be 3.1415844. $ hadoop jar
/usr/share/java/hadoop/hadoop-mapreduce-examples.jar pi 10 1000000 In the next test, you create 10 million records of 100 bytes each, that is 1GB of data(~1 min), then sort it(~8 min) & finally verify it(~1 min). You may want to clean up the directories created in the process.
Stop the hadoop services before creating and working with multiple datanodes and clean up the data directories.
The following steps simplify creation of multiple instances:
StrictHostKeyChecking no
Now, modify the configuration files located in /etc/hadoop.
Now, create a snapshot, Hadoop-Base. Its creation will take time. It may not give you an indication of an error if it runs out of disk space! Launch instances h-slv1 and h-slv2 serially using Hadoop-Base as the instance boot source. Launching of the first instance from a snapshot is pretty slow. In case the IP addresses are not the same as your guess in /etc/hosts, edit /etc/hosts on each of the three nodes to the correct value. For your convenience, you may want to make entries for h-slv1 and h-slv2 on the desktop /etc/hosts file as well. The following commands should be run from as fedora on h-mstr. Reformat the namenode to make sure that the single node tests are not causing any unexpected issues. $ sudo runuser hdfs -s /bin/bash
/bin/bash -c "hadoop namenode -format" Start the hadoop services on h-mstr. $ sudo systemctl start
hadoop-namenode hadoop-datanode hadoop-nodemanager
hadoop-resourcemanager Start the datanode and yarn services on the slave nodes:
Create the hdfs directories and a directory for user fedora as on a single node:
You can run the same tests as above. Although you are using 3 nodes, the improvement in the performance compared to the single node is not expected to be noticeable as it is running on a single desktop. The pi example took about 1 minute on the three nodes compared to 90 sec earlier. Terasort took 7 minutes instead of 8. Both OpenStack and Mapreduce are a collection of interrelated services working together. Diagnosing problems, especially, in the beginning is hard as each service has its own logfiles. It takes a while to get used to realising where to look. However, once they are working, it is incredible how easy they make distributed processing! Note: The timings are from an AMD Phenom II X4 965 with 16GB RAM. All virtual machines and their data was on a single physical disk. |