-
Notifications
You must be signed in to change notification settings - Fork 61
Hadoop intro
Before we begin, it’s important to roughly understand the three components of Hadoop:
-
Hadoop Distributed File System (HDFS) is a distributed file system based off Google File System (GFS) that splits files into “blocks” and stores them redundantly on several relatively cheap machines, known as DataNodes. Access to these DataNodes is coordinated by a relatively high-quality machine, known as a NameNode.
-
Hadoop MapReduce (based off Google MapReduce) is a paradigm for distributed computing that splits computations across several machines. In the Map task, each machine in the cluster calculates a user-defined function on a subset of the input data. The outputted data of the Map task is shuffled around the cluster of machines to be grouped or aggregated in the Reduce task.
-
YARN (unofficially Yet Another Resource Negotiator) is a new feature in Hadoop 2.0 (released in 2013) that manages resources and job scheduling, much like an operating system. This is an important improvement, especially in productions with multiple applications and users, but we won’t focus on this for now.
t2.micro instances should be sufficient to start playing with HDFS and MapReduce. The current setup uses the Ubuntu Server 14.04 LTS (HVM), SSD Volume Type. Be sure to terminate the instances if you want AWS to keep more free credits for the month. For practice you can try spinning up 4 nodes, treating one as the namenode(master) and the remaining three as datanodes.
The namenode of the Hadoop cluster must be able to communicate with the datanodes without requiring a password.
-
replace ~/.ssh/personal-aws.pem with your AWS account pem key
-
replace namenode-public-dns with your namenode public-dns (e.g. ubuntu@ec2-51-141-153-211.us-west-2.compute.amazonaws.com)
localhost$ scp -o "StrictHostKeyChecking no" -i ~/.ssh/personal-aws.pem ~/.ssh/personal-aws.pem ubuntu@namenode-public-dns:~/.ssh
localhost$ ssh -i ~/.ssh/personal-aws.pem ubuntu@namenode-public-dns
namenode$ sudo apt-get update namenode$ sudo apt-get install ssh rsync namenode$ ssh-keygen -f ~/.ssh/id_rsa -t rsa -P "" namenode$ sudo cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys Make sure you can successfully perform passwordless ssh by entering: namenode$ ssh localhost When prompted to add the host, type yes and then enter The authenticity of host 'localhost (127.0.0.1)' can't be established. ECDSA key fingerprint is ... Are you sure you want to continue connecting (yes/no)?
- This must be done for each datanode by changing the datanode-public-dns
namenode$ cat ~/.ssh/id_rsa.pub | ssh -o "StrictHostKeyChecking no" -i ~/.ssh/*.pem ubuntu@datanode-public-dns 'cat >> ~/.ssh/authorized_keys'
Hadoop will be installed on the namenode and all the datanodes. The namenode and datanodes will be configured after the main installation.
Run the following on the namenode and all datanodes by SSH-ing into each node:
Install java-development-kit name-or-data-node$ sudo apt-get update name-or-data-node$ sudo apt-get install openjdk-7-jdk Install Hadoop name-or-data-node$ wget http://mirror.symnds.com/software/Apache/hadoop/common/hadoop-2.7.1/hadoop-2.7.1.tar.gz -P ~/Downloads name-or-data-node$ sudo tar zxvf ~/Downloads/hadoop-*.tar.gz -C /usr/local name-or-data-node$ sudo mv /usr/local/hadoop-* /usr/local/hadoop Set the JAVA_HOME and HADOOP_HOME environment variables and add to PATH in .profile name-or-data-node$ nano ~/.profile Add the following to the end of the file export JAVA_HOME=/usr export PATH=$PATH:$JAVA_HOME/bin export HADOOP_HOME=/usr/local/hadoop export PATH=$PATH:$HADOOP_HOME/bin name-or-data-node$ . ~/.profile Set the Java path for hadoop-env.sh name-or-data-node$ sudo nano $HADOOP_HOME/etc/hadoop/hadoop-env.sh Find the line with export JAVA_HOME=${JAVA_HOME} and replace the ${JAVA_HOME} with /usr e.g. export JAVA_HOME=/usr Configure core-site.xml name-or-data-node$ sudo nano $HADOOP_HOME/etc/hadoop/core-site.xml Scroll down to the <configuration> tag and add the following between the <configuration> tags <property> <name>fs.defaultFS</name> <value>hdfs://namenode-public-dns:9000</value> </property> Be sure to also change the namenode-public-dns to your namenode's AWS public DNS Configure yarn-site.xml name-or-data-node$ sudo nano $HADOOP_HOME/etc/hadoop/yarn-site.xml Scroll down to the <configuration> tag and add the following between the <configuration> tags <property> <name>yarn.nodemanager.aux-services</name> <value>mapreduce_shuffle</value> </property> <property> <name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name> <value>org.apache.hadoop.mapred.ShuffleHandler</value> </property> <property> <name>yarn.resourcemanager.resource-tracker.address</name> <value>namenode-public-dns:8025</value> </property> <property> <name>yarn.resourcemanager.scheduler.address</name> <value>namenode-public-dns:8030</value> </property> <property> <name>yarn.resourcemanager.address</name> <value>namenode-public-dns:8050</value> </property> Be sure to also change the namenode-public-dns to your namenode's AWS public DNS Configure mapred-site.xml First make a mapred-site.xml by copying the mapred-site.xml.template file name-or-data-node$ sudo cp $HADOOP_HOME/etc/hadoop/mapred-site.xml.template $HADOOP_HOME/etc/hadoop/mapred-site.xml name-or-data-node$ sudo nano $HADOOP_HOME/etc/hadoop/mapred-site.xml Scroll down to the <configuration> tag and add the following between the <configuration> tags <property> <name>mapreduce.jobtracker.address</name> <value>name-or-data-node:54311</value> </property> <property> <name>mapreduce.framework.name</name> <value>yarn</value> </property> Be sure to also change the namenode-public-dns to your namenode's AWS public DNS
<b>localhost</b>$ ssh -i ~/.ssh/<b>personal-aws.pem</b> ubuntu@<b>namenode-public-dns</b>
<b>namenode</b>$ sudo nano /etc/hosts
The namenode and datanodes will be added to this list in the format of the public-dns followed by the node name. If you have 4 nodes in your Hadoop cluster, then 4 extra lines will be added to the file. Refer to the example:
e.g. If I have a node with public-dns: ec2-52-35-169-183.us-west-2.compute.amazonaws.com private-dns: ip-172-31-41-149.us-west-2.compute.internal
the following is what should be added to the /etc/hosts file on the namenode:
ec2-52-35-169-183.us-west-2.compute.amazonaws.com ip-172-31-41-149
namenode$ sudo nano $HADOOP_HOME/etc/hadoop/hdfs-site.xml Scroll down to the <configuration> tag and add the following between the <configuration> tags <property> <name>dfs.replication</name> <value>3</value> </property> <property> <name>dfs.namenode.name.dir</name> <value>file:///usr/local/hadoop/hadoop_data/hdfs/namenode</value> </property> namenode$ sudo mkdir -p $HADOOP_HOME/hadoop_data/hdfs/namenode Change ownership of the newly created directory in the Hadoop folder namenode$ sudo chown -R ubuntu $HADOOP_HOME
namenode$ sudo nano $HADOOP_HOME/etc/hadoop/masters Add the first part of the namenode’s private DNS e.g. private-dns: ip-172-31-41-148.us-west-2.compute.internal ip-172-31-41-148 namenode$ sudo nano $HADOOP_HOME/etc/hadoop/slaves Delete the localhost and replace with the datanode private DNS names e.g. If you have 3 datanodes with private-dns-1: ip-172-31-41-149.us-west-2.compute.internal private-dns-1: ip-172-31-41-150.us-west-2.compute.internal private-dns-1: ip-172-31-41-151.us-west-2.compute.internal you'll add the following: ip-172-31-41-149 ip-172-31-41-150 ip-172-31-41-151
This must be done on each datanode you have in your Hadoop cluster
localhost$ ssh -i ~/.ssh/personal-aws.pem ubuntu@datanode-public-dns
datanode$ sudo nano $HADOOP_HOME/etc/hadoop/hdfs-site.xml Scroll down to the <configuration> tag and add the following between the <configuration> tags <property> <name>dfs.replication</name> <value>3</value> </property> <property> <name>dfs.datanode.data.dir</name> <value>file:///usr/local/hadoop/hadoop_data/hdfs/datanode</value> </property> datanode$ sudo mkdir -p $HADOOP_HOME/hadoop_data/hdfs/datanode Change ownership of the newly created directory in the Hadoop folder datanode$ sudo chown -R ubuntu $HADOOP_HOME
localhost$ ssh -i ~/.ssh/personal-aws.pem ubuntu@namenode-public-dns
namenode$ hdfs namenode -format namenode$ $HADOOP_HOME/sbin/start-dfs.sh When asked “The authenticity of host ‘Some Node’ can’t be established. Are you sure you want to continue connecting (yes/no)?” type yes and press enter. Even if there is no new prompt, you may need to do this several times - keep typing yes, then enter since it’s the first time for Hadoop to log into each of the datanodes.
You can go to namenode-public-dns:50070 in your browser to check if all datanodes are online. If the webUI does not display, check to make sure your EC2 instances have security group settings that include All Traffic and not just SSH.
namenode$ $HADOOP_HOME/sbin/start-yarn.sh
namenode$ $HADOOP_HOME/sbin/mr-jobhistory-daemon.sh start historyserver
You can check if all the processes are running on the namenode by running
namenode$ jps 10595 NameNode 10231 JobHistoryServer 11193 Jps 10944 ResourceManager 10810 SecondaryNamenode
# create local dummy file to place on HDFS namenode$ echo “Hello this will be my first distributed and fault-tolerant data set\!” | cat >> my_file.txt # list directories from top level of HDFS namenode$ hdfs dfs -ls / … should display nothing but the /tmp directory # create /user directory on HDFS namenode$ hdfs dfs -mkdir /user namenode$ hdfs dfs -ls / Found 2 items drwxrwx--- - ubuntu supergroup 0 2015-09-13 19:25 /tmp drwxr-xr-x - ubuntu supergroup 0 2015-09-13 19:28 /user # copy local file a few times onto HDFS namenode$ hdfs dfs -copyFromLocal ~/my_file.txt /user namenode$ hdfs dfs -copyFromLocal ~/my_file.txt /user/my_file2.txt namenode$ hdfs dfs -copyFromLocal ~/my_file.txt /user/my_file3.txt # list files in /user directory namenode$ hdfs dfs -ls /user Found 3 items -rw-r--r-- 3 ubuntu supergroup 50 2015-05-06 22:43 /user/my_file.txt -rw-r--r-- 3 ubuntu supergroup 50 2015-05-06 22:43 /user/my_file2.txt -rw-r--r-- 3 ubuntu supergroup 50 2015-05-06 22:43 /user/my_file3.txt # clear all data and folders on HDFS namenode$ hdfs dfs -rm /user/my_file* 15/05/06 22:49:06 INFO fs.TrashPolicyDefault: Namenode trash configuration: Deletion interval = 0 minutes, Emptier interval = 0 minutes. Deleted /user/my_file.txt 15/05/06 22:49:06 INFO fs.TrashPolicyDefault: Namenode trash configuration: Deletion interval = 0 minutes, Emptier interval = 0 minutes. Deleted /user/my_file2.txt 15/05/06 22:49:06 INFO fs.TrashPolicyDefault: Namenode trash configuration: Deletion interval = 0 minutes, Emptier interval = 0 minutes. Deleted /user/my_file3.txt namenode$ hdfs dfs -rmdir /user
Find out more about the Insight Data Engineering Fellows Program in New York and Silicon Valley, apply today, or sign up for program updates.
You can also read our engineering blog here.