Before you are able to run any MapReduce jobs, you will need to configure your assigned machines to run as a Hadoop cluster. Your machines are essentially blank slates, and have no software installed or configured other than the default Linux installations. You have full administration permissions (aka 'root'), which means you can arbitrarily configure your machines using the sudo
command (superuser-do) -- but be careful, since this allows you to arbitrarily modify system files!
The setup procedure is a one-time operation, and only needs to be completed once per group (not once for every group member). Although every group member has their own account on the machines, it doesn't matter whose account is used to complete the cluster setup.
Many of these steps need to be repeated on each of your cluster machines. You may wish to open up multiple terminal windows to simultaneously configure all of your machines. Unless otherwise noted, perform each step on each of your cluster machines.
Before you begin, you will need to have the IP addresses of your four cluster machines, which I will provide over Slack once you confirm your group members. You will be able to SSH into your cluster machines using your existing SSH key the same as you do to access turing
.
sudo
, but an easier way (given that we need sudo on almost every command here anyways) is to just spawn a superuser shell:
$ sudo bash #Within this shell, every command is implicitly run "sudo". Of course, this also means you should exercise caution with what you run! By convention, we denote a superuser shell prompt by
#
(as opposed to a non-sudo shell prompt denoted by $
).
yum
:
# yum install java-1.8.0-openjdk-develRemember to do this on every machine.
yum
to update everything (this should complete quickly, and may not need to update anything):
# yum update
turing
(the first command creates the key; the second command converts it to a more compatible format):
$ ssh-keygen -t rsa -N '' -C 'hadoop' -f hadoop-keypair $ ssh-keygen -p -N '' -m pem -f hadoop-keypairThese commands will have created
hadoop-keypair
(the private key) and hadoop-keypair.pub
(the public key). You will now copy the keyfiles from turing
out to each cluster machine (use your appropriate username). Make sure to do this once per machine (but only create the key once; don't create a separate key for each cluster machine):
$ scp -i sbarker-keypair hadoop-keypair* sbarker@1.2.3.4:~/
hadoop-key
private key into the SSH directory for the root user (and we're going to rename it to id_rsa
, which is the default filename expected by SSH):
# cp ~sbarker/hadoop-keypair /root/.ssh/id_rsaNow, copy the public keyfile to the root user's
authorized_keys
file, which controls which keys provide login access. Answer 'yes' when asked if you want to overwrite the existing file.
# cp ~sbarker/hadoop-keypair.pub /root/.ssh/authorized_keys
ssh
client configuration file:
# nano /etc/ssh/ssh_configFind the uncommented line reading
Host *
(there is another commented line; ignore that one). Just under it, add the following line:
StrictHostKeyChecking noSave and quit the file.
1.2.3.4
, just run ssh 1.2.3.4
. If each login succeeds, just run 'exit' to get back to the master shell and then try the next machine. If you get a "Permission denied" error when connecting to any of your machines, you made a mistake during the SSH configuration, so go back and check your work or ask for help.
/dev/sdb
) is completely unformatted, so first we need to initialize it with a filesystem (we'll use the ext4
filesytem):
# mkfs -t ext4 /dev/sdbNow let's create a directory that we'll use to "mount" (i.e., attach) the new filesystem.
# mkdir /mnt/dataRight now this is just a regular directory on the system filesystem - now let's mount the new filesystem and bind it to that directory:
# mount /dev/sdb /mnt/dataTo check that this worked, run
df -h
to list all mounted filesystems. In the list, you should see (among other things), the main filesystem mounted on /
and the newly mounted filesystem on /mnt/data
with a size of about 200 GB, i.e., a line something like this:
/dev/xvdb 197G 61M 187G 1% /mnt/dataAssuming it worked, now we can create a directory on the attached disk for Hadoop to use for HDFS storage:
# mkdir /mnt/data/hadoop
ifconfig
on each machine - the IP will appear in a line looking like the following (and should start with 172):
inet 172.15.37.52For future reference (and to prevent later confusion), record all your public IPs and their corresponding internal IPs from
ifconfig
. You will need the internal IPs for configuring Hadoop in the next section.
yum
command. On each of your machines, download the Hadoop files using wget
, unpack the archive, and store them at /usr/local/hadoop
:
# wget https://archive.apache.org/dist/hadoop/core/hadoop-2.10.1/hadoop-2.10.1.tar.gz # tar xzvf hadoop-2.10.1.tar.gz # mv hadoop-2.10.1 /usr/local/hadoop
/usr/local/hadoop/etc/hadoop/slaves
in your favorite editor. You need to change this file to contain the private IP addresses of all your non-master machines (1 per line). Remove all other lines. For example:
1.2.3.4 2.3.4.5 3.4.5.6Delete any other contents of the file.
/usr/local/hadoop/etc/hadoop/masters
and put the private IP of your master machine in this file. This should be the only line in this file./usr/local/hadoop/etc/hadoop
directory. Insert the following XML for the following 3 files (leaving the xml header intact) within the existing <configuration>
tags. Replace the placeholder MASTER-IPs with your actual private master IP.
core-site.xml
:
<property> <name>fs.default.name</name> <value>hdfs://MASTER-IP:9000</value> </property> <property> <name>hadoop.tmp.dir</name> <value>/mnt/data/hadoop/tmp</value> </property>In
hdfs-site.xml
:
<property> <name>dfs.replication</name> <value>2</value> </property>In
yarn-site.xml
:
<property> <name>yarn.resourcemanager.hostname</name> <value>MASTER-IP</value> </property> <property> <name>yarn.nodemanager.aux-services</name> <value>mapreduce_shuffle</value> </property> <property> <name>yarn.nodemanager.resource.memory-mb</name> <value>1536</value> </property> <property> <name>yarn.scheduler.maximum-allocation-mb</name> <value>1536</value> </property> <property> <name>yarn.scheduler.minimum-allocation-mb</name> <value>128</value> </property> <property> <name>yarn.nodemanager.vmem-check-enabled</name> <value>false</value> </property>In
mapred-site.xml
(first copy mapred-site.xml.template
to create this file, then modify it as shown):
<property> <name>mapreduce.framework.name</name> <value>yarn</value> </property> <property> <name>yarn.app.mapreduce.am.resource.mb</name> <value>512</value> </property> <property> <name>mapreduce.map.memory.mb</name> <value>300</value> </property> <property> <name>mapreduce.reduce.memory.mb</name> <value>300</value> </property> <property> <name>mapreduce.map.java.opts</name> <value>-Xmx200m</value> </property> <property> <name>mapreduce.reduce.java.opts</name> <value>-Xmx200m</value> </property>In
hadoop-env.sh
, change the export JAVA_HOME=...
line to the following:
export JAVA_HOME="/usr/lib/jvm/java-1.8.0-openjdk/"
# scp -r /usr/local/hadoop/etc/hadoop/* 1.2.3.4:/usr/local/hadoop/etc/hadoop/
# /usr/local/hadoop/bin/hdfs namenode -formatIf the above command works, it will start the NameNode, run for a few seconds, dump a lot of output, and then exit (having formatted the distributed filesystem).
# /usr/local/hadoop/sbin/start-dfs.shThis script will start the NameNode locally, then connect to all of the worker machines and start DataNodes there.
jps
command, which will list all running Java processes on the local machine.
Run jps
on the master and you should see a NameNode entry. Also check the other machines - running jps
as root on any other machine should now show you that a DataNode is running. If you don't see a DataNode running on every machine, then something went wrong previously. Hadoop logfiles are written to /usr/local/hadoop/logs/
, which may tell you something about what went wrong.
# /usr/local/hadoop/sbin/start-yarn.shAs before, use
jps
to make sure this succeeded. On the master, you should a new ResourceManager (in addition to the previous NameNode). On each of the other nodes, you should see a NodeManager (but no ResourceManager). If everything seems to be running, proceed to the next section.
hdfs
program. First let's copy our Hadoop configuration directory into HDFS (arbitrarily chosen; note that this make take a little while):
# /usr/local/hadoop/bin/hdfs dfs -put /usr/local/hadoop/etc/hadoop /testThis command copies the Hadoop configuration directory (from our local, non-HDFS filesystem) into HDFS (using the 'put' command) as a directory named
test
in the root HDFS directory. You might get a bunch of InterruptedException
warnings during this like the following, which can be ignored:
22/03/30 18:59:38 WARN hdfs.DataStreamer: Caught exception java.lang.InterruptedException at java.lang.Object.wait(Native Method) at java.lang.Thread.join(Thread.java:1252) at java.lang.Thread.join(Thread.java:1326) at org.apache.hadoop.hdfs.DataStreamer.closeResponder(DataStreamer.java:980) at org.apache.hadoop.hdfs.DataStreamer.endBlock(DataStreamer.java:630) at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:807)To check that the copy into HDFS worked, we can use the 'ls' command to view the directory that we just created within HDFS:
# /usr/local/hadoop/bin/hdfs dfs -ls /testThis command should show a listing of the Hadoop configuration files (a copy of which are now living inside HDFS). Other DFS commands that work as you'd expect include 'cat' (view a file), 'cp' (copy a file within HDFS), and 'get' (copy a file from HDFS back into the local filesystem). Also note that since this is a distributed filesystem, it doesn't matter which node you run these commands from - they're all accessing the same (distributed) filesystem.
test
directory we created in HDFS, looking for text that starts with the three letters dfs
, and then stores the output results (in HDFS) in the test-output
directory:
# /usr/local/hadoop/bin/hadoop jar \ /usr/local/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.10.1.jar \ grep "/test" "/test-output" "dfs[a-z.]+"This command will launch the job and print periodic status updates on its progress (note that the web monitor URL won't be accessible due to the firewall settings). You should not see any nasty error messages during the job run if everything is working correctly (apart from possibly the same warnings as before). Your tiny cluster is unlikely to surprise you with its blazing speed; expect the job to take a minute or two to execute.
# /usr/local/hadoop/bin/hdfs dfs -cat /test-output/*The output of this example job just includes the matched pieces of text and how many times they appeared in the searched documents. If you'd like to compare to a non-distributed grep (which will also show the entire lines), you can run the following:
# grep -P dfs[a-z.]+ /usr/local/hadoop/etc/hadoop/*
# /usr/local/hadoop/sbin/stop-yarn.sh # /usr/local/hadoop/sbin/stop-dfs.shIn general, it's fine to leave the cluster running when not running jobs, though if you think you might've broken something, it's a good idea to reboot the cluster by stopping and then restarting the daemons.
# rm -rf /mnt/data/hadoop/tmp/dfsOnce you've done that on all your cluster machines, you can return to the beginning of Part 3, starting with formatting a new HDFS filesystem, and proceed from there.