Hadoop Cluster Setup

Before you are able to run any MapReduce jobs, you will need to configure your assigned machines to run as a Hadoop cluster. Your machines are essentially blank slates, and have no software installed or configured other than the default Linux installations. You have full administration permissions (aka 'root'), which means you can arbitrarily configure your machines using the sudo command (superuser-do) -- but be careful, since this allows you to arbitrarily modify system files!

The setup procedure is a one-time operation, and only needs to be completed once per group (not once for every group member). Although every group member has their own account on the machines, it doesn't matter whose account is used to complete the cluster setup.

Many of these steps need to be repeated on each of your cluster machines. You may wish to open up multiple terminal windows to simultaneously configure all of your machines. Unless otherwise noted, perform each step on each of your cluster machines.

Before you begin, you will need to have the IP addresses of your four cluster machines, which I will provide over Slack once you confirm your group members. You will be able to SSH into your cluster machines using your existing SSH key the same as you do to access turing.

1. Cluster SSH Configuration

Most of the commands you'll run on your cluster machines require administrative privileges. One way to do this is to preface every command with sudo, but an easier way (given that we need sudo on almost every command here anyways) is to just spawn a superuser shell:
```
$ sudo bash
#
```
Within this shell, every command is implicitly run "sudo". Of course, this also means you should exercise caution with what you run! By convention, we denote a superuser shell prompt by # (as opposed to a non-sudo shell prompt denoted by $).
Java is not part of the base system, so first you need to install a JDK. Most software on Linux is easily installed via a package manager; on these systems, the package manager is yum:
```
# yum install java-1.8.0-openjdk-devel
```
Remember to do this on every machine.
While we're at it, it's a good idea to make sure all software is up-to-date before you start customizing your machine. Use yum to update everything (this should complete quickly, and may not need to update anything):
```
# yum update
```
We need to configure your machines so that they can SSH among themselves (which Hadoop will use for communication and control). Currently you are able to SSH into any of your machines using your private key, but the machines aren't able to log into each other. You'll need to create a new SSH key for the use of the Hadoop cluster itself (which will be different from your existing key that gives you access to the cluster). Create the new key on turing (the first command creates the key; the second command converts it to a more compatible format):
```
$ ssh-keygen -t rsa -N '' -C 'hadoop' -f hadoop-keypair
$ ssh-keygen -p -N '' -m pem -f hadoop-keypair
```
These commands will have created hadoop-keypair (the private key) and hadoop-keypair.pub (the public key). You will now copy the keyfiles from turing out to each cluster machine (use your appropriate username). Make sure to do this once per machine (but only create the key once; don't create a separate key for each cluster machine):
```
$ scp -i sbarker-keypair hadoop-keypair* sbarker@1.2.3.4:~/
```
Now, back on (all of) the cluster machines, we're going to modify the SSH settings to allow root logins using the newly-created SSH key. Be careful with these steps, as errors could corrupt your SSH configuration and lock you out of your machine entirely. Copy the hadoop-key private key into the SSH directory for the root user (and we're going to rename it to id_rsa, which is the default filename expected by SSH):
```
# cp ~sbarker/hadoop-keypair /root/.ssh/id_rsa
```
Now, copy the public keyfile to the root user's authorized_keys file, which controls which keys provide login access. Answer 'yes' when asked if you want to overwrite the existing file.
```
# cp ~sbarker/hadoop-keypair.pub /root/.ssh/authorized_keys
```
Normally, SSH will prompt you to verify new hosts that you connect to. This behavior can cause headaches for us, so we'll just disable it by editing the ssh client configuration file:
```
# nano /etc/ssh/ssh_config
```
Find the uncommented line reading Host * (there is another commented line; ignore that one). Just under it, add the following line:
```
StrictHostKeyChecking no
```
Save and quit the file.
Run a quick test on one of the machines to make sure that your SSH configuration is working on all machines. As root, try SSHing to each of your machines (including the machine you're starting from -- i.e., you should be able to SSH into the local machine). E.g., from a root shell on cluster machine 1, if cluster machine 1's IP is 1.2.3.4, just run ssh 1.2.3.4. If each login succeeds, just run 'exit' to get back to the master shell and then try the next machine. If you get a "Permission denied" error when connecting to any of your machines, you made a mistake during the SSH configuration, so go back and check your work or ask for help.
Now we need to create a directory for Hadoop to store most of its data. Most importantly, this directory is where Hadoop will be storing the local data blocks of the HDFS (Hadoop Distributed File System). Since we may be working with a lot of data, we want to put this directory on an external disk instead of the small system drive. Your machines each have 200 GB of external storage attached on a disk separate from the boot drive (which is only 16 GB). This secondary disk (/dev/sdb) is completely unformatted, so first we need to initialize it with a filesystem (we'll use the ext4 filesytem):
```
# mkfs -t ext4 /dev/sdb
```
Now let's create a directory that we'll use to "mount" (i.e., attach) the new filesystem.
```
# mkdir /mnt/data
```
Right now this is just a regular directory on the system filesystem - now let's mount the new filesystem and bind it to that directory:
```
# mount /dev/sdb /mnt/data
```
To check that this worked, run df -h to list all mounted filesystems. In the list, you should see (among other things), the main filesystem mounted on / and the newly mounted filesystem on /mnt/data with a size of about 200 GB, i.e., a line something like this:
```
/dev/xvdb       197G   61M  187G   1% /mnt/data
```
Assuming it worked, now we can create a directory on the attached disk for Hadoop to use for HDFS storage:
```
# mkdir /mnt/data/hadoop
```
Each of your machines actually has two IP addresses - the public IP (which is what you use to SSH into each machine), and an internal IP (which only works from within the cluster). To find the internal IPs, run ifconfig on each machine - the IP will appear in a line looking like the following (and should start with 172):
```
inet 172.15.37.52
```
For future reference (and to prevent later confusion), record all your public IPs and their corresponding internal IPs from ifconfig. You will need the internal IPs for configuring Hadoop in the next section.

2. Hadoop Configuration

First we need to download the Hadoop software. Unfortunately, this isn't quite as easy as running a one-line yum command. On each of your machines, download the Hadoop files using wget, unpack the archive, and store them at /usr/local/hadoop:
```
# wget https://archive.apache.org/dist/hadoop/core/hadoop-2.10.1/hadoop-2.10.1.tar.gz
# tar xzvf hadoop-2.10.1.tar.gz
# mv hadoop-2.10.1 /usr/local/hadoop
```
Pick one of your machines to configure as the Hadoop master. We're going to get all of the Hadoop configuration files setup on the master (next 2 steps), then just copy them to the rest of your cluster. Make a note of the IP (public and private) of the master. For all of the following configuration files, use the private Amazon IPs only - things will break if you use the public IPs!
On your chosen master, open up /usr/local/hadoop/etc/hadoop/slaves in your favorite editor. You need to change this file to contain the private IP addresses of all your non-master machines (1 per line). Remove all other lines. For example:
```
1.2.3.4
2.3.4.5
3.4.5.6
```
Delete any other contents of the file.
Similarly, open up /usr/local/hadoop/etc/hadoop/masters and put the private IP of your master machine in this file. This should be the only line in this file.

Now (and still on the master only) we have to change some other configuration files in the /usr/local/hadoop/etc/hadoop directory. Insert the following XML for the following 3 files (leaving the xml header intact) within the existing <configuration> tags. Replace the placeholder MASTER-IPs with your actual private master IP.

In core-site.xml:

   <property>
       <name>fs.default.name</name>
       <value>hdfs://MASTER-IP:9000</value>
   </property>
   <property>
       <name>hadoop.tmp.dir</name>
       <value>/mnt/data/hadoop/tmp</value>
   </property>

In hdfs-site.xml:

   <property>
       <name>dfs.replication</name>
       <value>2</value>
   </property>

In yarn-site.xml:

  <property>
        <name>yarn.resourcemanager.hostname</name>
        <value>MASTER-IP</value>
  </property>
  <property>
    <name>yarn.nodemanager.aux-services</name>
    <value>mapreduce_shuffle</value>
  </property>
  <property>
        <name>yarn.nodemanager.resource.memory-mb</name>
        <value>1536</value>
  </property>
  <property>
        <name>yarn.scheduler.maximum-allocation-mb</name>
        <value>1536</value>
  </property>
  <property>
        <name>yarn.scheduler.minimum-allocation-mb</name>
        <value>128</value>
  </property>
  <property>
        <name>yarn.nodemanager.vmem-check-enabled</name>
        <value>false</value>
  </property>

In mapred-site.xml (first copy mapred-site.xml.template to create this file, then modify it as shown):

  <property>
        <name>mapreduce.framework.name</name>
        <value>yarn</value>
  </property>
  <property>
        <name>yarn.app.mapreduce.am.resource.mb</name>
        <value>512</value>
  </property>
  <property>
        <name>mapreduce.map.memory.mb</name>
        <value>300</value>
  </property>
  <property>
        <name>mapreduce.reduce.memory.mb</name>
        <value>300</value>
  </property>
  <property>
        <name>mapreduce.map.java.opts</name>
        <value>-Xmx200m</value>
  </property>
  <property>
        <name>mapreduce.reduce.java.opts</name>
        <value>-Xmx200m</value>
  </property>

In hadoop-env.sh, change the export JAVA_HOME=... line to the following:

export JAVA_HOME="/usr/lib/jvm/java-1.8.0-openjdk/"

Now, copy your modified configuration files from the master out to the rest of your cluster. For each of your worker nodes (other than the master), copy the configuration files from the master (run this on the master for each of the other machines):
```
# scp -r /usr/local/hadoop/etc/hadoop/* 1.2.3.4:/usr/local/hadoop/etc/hadoop/
```

3. Launch the Cluster

Hadoop stores all of its data in a distributed file system called HDFS (the Hadoop Distributed File System). The basic architecture of HDFS is a single "NameNode", which is a master server that manages the HDFS filesystem, and then any number of "DataNodes", which actually hold the distributed data on attached storage. We will run the NameNode server on your designated master and a DataNode on every other machine. Before we can run any of this, however, we need to format the new HDFS filesystem. Run this command (once) from the master:
```
# /usr/local/hadoop/bin/hdfs namenode -format
```
If the above command works, it will start the NameNode, run for a few seconds, dump a lot of output, and then exit (having formatted the distributed filesystem).
Now, fire up the HDFS daemon programs (this will start the NameNode as well as all DataNodes on all machines):
```
# /usr/local/hadoop/sbin/start-dfs.sh
```
This script will start the NameNode locally, then connect to all of the worker machines and start DataNodes there.
The best way to check that this worked is using the jps command, which will list all running Java processes on the local machine. Run jps on the master and you should see a NameNode entry. Also check the other machines - running jps as root on any other machine should now show you that a DataNode is running. If you don't see a DataNode running on every machine, then something went wrong previously. Hadoop logfiles are written to /usr/local/hadoop/logs/, which may tell you something about what went wrong.
Now, we'll start YARN ("Yet Another Resource Negotiator"), which is the Hadoop framework responsible for resource management and job scheduling (i.e., cluster management). On the master:
```
# /usr/local/hadoop/sbin/start-yarn.sh
```
As before, use jps to make sure this succeeded. On the master, you should a new ResourceManager (in addition to the previous NameNode). On each of the other nodes, you should see a NodeManager (but no ResourceManager). If everything seems to be running, proceed to the next section.

4. Test the Cluster

First we'll test the distributed filesystem. Interaction with HDFS is exclusively done via the hdfs program. First let's copy our Hadoop configuration directory into HDFS (arbitrarily chosen; note that this make take a little while):
```
# /usr/local/hadoop/bin/hdfs dfs -put /usr/local/hadoop/etc/hadoop /test
```
This command copies the Hadoop configuration directory (from our local, non-HDFS filesystem) into HDFS (using the 'put' command) as a directory named test in the root HDFS directory. You might get a bunch of InterruptedException warnings during this like the following, which can be ignored:
```
22/03/30 18:59:38 WARN hdfs.DataStreamer: Caught exception
java.lang.InterruptedException
	at java.lang.Object.wait(Native Method)
	at java.lang.Thread.join(Thread.java:1252)
	at java.lang.Thread.join(Thread.java:1326)
	at org.apache.hadoop.hdfs.DataStreamer.closeResponder(DataStreamer.java:980)
	at org.apache.hadoop.hdfs.DataStreamer.endBlock(DataStreamer.java:630)
	at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:807)
```
To check that the copy into HDFS worked, we can use the 'ls' command to view the directory that we just created within HDFS:
```
# /usr/local/hadoop/bin/hdfs dfs -ls /test
```
This command should show a listing of the Hadoop configuration files (a copy of which are now living inside HDFS). Other DFS commands that work as you'd expect include 'cat' (view a file), 'cp' (copy a file within HDFS), and 'get' (copy a file from HDFS back into the local filesystem). Also note that since this is a distributed filesystem, it doesn't matter which node you run these commands from - they're all accessing the same (distributed) filesystem.
Now let's run an actual MapReduce job using one of the example jobs provided with Hadoop. Here's a distributed grep example that searches for a pattern in the test directory we just stored. In particular, this job will search all the files in the test directory we created in HDFS, looking for text that starts with the three letters dfs, and then stores the output results (in HDFS) in the test-output directory:
```
# /usr/local/hadoop/bin/hadoop jar \
      /usr/local/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.10.1.jar \
      grep "/test" "/test-output" "dfs[a-z.]+"
```
This command will launch the job and print periodic status updates on its progress (note that the web monitor URL won't be accessible due to the firewall settings). You should not see any nasty error messages during the job run if everything is working correctly (apart from possibly the same warnings as before). Your tiny cluster is unlikely to surprise you with its blazing speed; expect the job to take a minute or two to execute.
Once it's finished, to see the actual output files of the job, we can view them in HDFS like so:
```
# /usr/local/hadoop/bin/hdfs dfs -cat /test-output/*
```
The output of this example job just includes the matched pieces of text and how many times they appeared in the searched documents. If you'd like to compare to a non-distributed grep (which will also show the entire lines), you can run the following:
```
# grep -P dfs[a-z.]+ /usr/local/hadoop/etc/hadoop/*
```
To shut down the Hadoop cluster, all you need to do is stop YARN and HDFS, as follows:
```
# /usr/local/hadoop/sbin/stop-yarn.sh
# /usr/local/hadoop/sbin/stop-dfs.sh
```
In general, it's fine to leave the cluster running when not running jobs, though if you think you might've broken something, it's a good idea to reboot the cluster by stopping and then restarting the daemons.
If you inadvertently break your cluster or start encountering unexpected errors that you didn't see before, you may want to completely reset HDFS and start from scratch. To do so, first stop the YARN and HDFS daemons (as above), and then delete the the entire HDFS data directories on each of your four cluster machines. ONLY DO THIS IF YOU WANT TO RESET HDFS - THIS WILL DESTROY EVERYTHING IN HDFS!:
```
# rm -rf /mnt/data/hadoop/tmp/dfs
```
Once you've done that on all your cluster machines, you can return to the beginning of Part 3, starting with formatting a new HDFS filesystem, and proceed from there.
If everything above is working, congratulations, you've built a working Hadoop cluster! The next thing you should do is try running your own Hadoop job.