Skip to content

Hadoop intro

David E Drummond edited this page Mar 4, 2018 · 9 revisions

Introduction

Before we begin, it’s important to roughly understand the three components of Hadoop:

  1. Hadoop Distributed File System (HDFS) is a distributed file system based off Google File System (GFS) that splits files into “blocks” and stores them redundantly on several relatively cheap machines, known as DataNodes. Access to these DataNodes is coordinated by a relatively high-quality machine, known as a NameNode.

  2. Hadoop MapReduce (based off Google MapReduce) is a paradigm for distributed computing that splits computations across several machines. In the Map task, each machine in the cluster calculates a user-defined function on a subset of the input data. The outputted data of the Map task is shuffled around the cluster of machines to be grouped or aggregated in the Reduce task.

  3. YARN (unofficially Yet Another Resource Negotiator) is a new feature in Hadoop 2.0 (released in 2013) that manages resources and job scheduling, much like an operating system. This is an important improvement, especially in productions with multiple applications and users, but we won’t focus on this for now.

Spin up AWS Instances

t2.micro instances should be sufficient to start playing with HDFS and MapReduce. The current setup uses the Ubuntu Server 14.04 LTS (HVM), SSD Volume Type. Be sure to terminate the instances if you want AWS to keep more free credits for the month. For practice you can try spinning up 4 nodes, treating one as the namenode(master) and the remaining three as datanodes.

Setup passwordless SSH

The namenode of the Hadoop cluster must be able to communicate with the datanodes without requiring a password.

Copy AWS pem-key from localhost to namenode

localhost$ scp -o "StrictHostKeyChecking no" -i ~/.ssh/personal-aws.pem ~/.ssh/personal-aws.pem ubuntu@namenode-public-dns:~/.ssh

SSH into the namenode

localhost$ ssh -i ~/.ssh/personal-aws.pem ubuntu@namenode-public-dns

Generate an authorization key for the namenode

namenode$ sudo apt-get update
namenode$ sudo apt-get install ssh rsync
namenode$ ssh-keygen -f ~/.ssh/id_rsa -t rsa -P ""
namenode$ sudo cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys

Make sure you can successfully perform passwordless ssh by entering:
namenode$ ssh localhost

When prompted to add the host, type yes and then enter

The authenticity of host 'localhost (127.0.0.1)' can't be established. ECDSA key fingerprint is ... Are you sure you want to continue connecting (yes/no)?

Copy the namenode’s id_rsa.pub to the datanodes’ authorized_keys

  • This must be done for each datanode by changing the datanode-public-dns
namenode$ cat ~/.ssh/id_rsa.pub | ssh -o "StrictHostKeyChecking no" -i ~/.ssh/*.pem ubuntu@datanode-public-dns 'cat >> ~/.ssh/authorized_keys'

Setup Hadoop cluster

Hadoop will be installed on the namenode and all the datanodes. The namenode and datanodes will be configured after the main installation.

Run the following on the namenode and all datanodes by SSH-ing into each node:

Install java-development-kit
name-or-data-node$ sudo apt-get update
name-or-data-node$ sudo apt-get install openjdk-7-jdk

Install Hadoop
name-or-data-node$ wget http://mirror.symnds.com/software/Apache/hadoop/common/hadoop-2.7.1/hadoop-2.7.1.tar.gz -P ~/Downloads
name-or-data-node$ sudo tar zxvf ~/Downloads/hadoop-*.tar.gz -C /usr/local
name-or-data-node$ sudo mv /usr/local/hadoop-* /usr/local/hadoop

Set the JAVA_HOME and HADOOP_HOME environment variables and add to PATH in .profile
name-or-data-node$ nano ~/.profile
Add the following to the end of the file

export JAVA_HOME=/usr
export PATH=$PATH:$JAVA_HOME/bin

export HADOOP_HOME=/usr/local/hadoop
export PATH=$PATH:$HADOOP_HOME/bin

name-or-data-node$ . ~/.profile

Set the Java path for hadoop-env.sh
name-or-data-node$ sudo nano $HADOOP_HOME/etc/hadoop/hadoop-env.sh

Find the line with export JAVA_HOME=${JAVA_HOME} and replace the ${JAVA_HOME} with /usr

e.g.
export JAVA_HOME=/usr

Configure core-site.xml
name-or-data-node$ sudo nano $HADOOP_HOME/etc/hadoop/core-site.xml

Scroll down to the <configuration> tag and add the following between the <configuration> tags

<property> 
   <name>fs.defaultFS</name>
   <value>hdfs://namenode-public-dns:9000</value>
</property> 

Be sure to also change the namenode-public-dns to your namenode's AWS public DNS

Configure yarn-site.xml
name-or-data-node$ sudo nano $HADOOP_HOME/etc/hadoop/yarn-site.xml

Scroll down to the <configuration> tag and add the following between the <configuration> tags

<property>
   <name>yarn.nodemanager.aux-services</name>
   <value>mapreduce_shuffle</value>
</property>
   
<property>
   <name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
   <value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>

<property>
   <name>yarn.resourcemanager.resource-tracker.address</name>
   <value>namenode-public-dns:8025</value>
</property>

<property>
   <name>yarn.resourcemanager.scheduler.address</name>
   <value>namenode-public-dns:8030</value>
</property>

<property>
   <name>yarn.resourcemanager.address</name>
   <value>namenode-public-dns:8050</value>
</property>

Be sure to also change the namenode-public-dns to your namenode's AWS public DNS

Configure mapred-site.xml
First make a mapred-site.xml by copying the mapred-site.xml.template file
name-or-data-node$ sudo cp $HADOOP_HOME/etc/hadoop/mapred-site.xml.template $HADOOP_HOME/etc/hadoop/mapred-site.xml
name-or-data-node$ sudo nano $HADOOP_HOME/etc/hadoop/mapred-site.xml
Scroll down to the <configuration> tag and add the following between the <configuration> tags

<property>
   <name>mapreduce.jobtracker.address</name>
   <value>name-or-data-node:54311</value>
</property>

<property>
   <name>mapreduce.framework.name</name>
   <value>yarn</value>
</property>

Be sure to also change the namenode-public-dns to your namenode's AWS public DNS

Configure Namenode

SSH into the namenode

<b>localhost</b>$ ssh -i ~/.ssh/<b>personal-aws.pem</b> ubuntu@<b>namenode-public-dns</b>

Add to hosts

<b>namenode</b>$ sudo nano /etc/hosts

The namenode and datanodes will be added to this list in the format of the public-dns followed by the node name. If you have 4 nodes in your Hadoop cluster, then 4 extra lines will be added to the file. Refer to the example:

e.g. If I have a node with public-dns: ec2-52-35-169-183.us-west-2.compute.amazonaws.com private-dns: ip-172-31-41-149.us-west-2.compute.internal

the following is what should be added to the /etc/hosts file on the namenode:

ec2-52-35-169-183.us-west-2.compute.amazonaws.com ip-172-31-41-149

Configure hdfs-site.xml

namenode$ sudo nano $HADOOP_HOME/etc/hadoop/hdfs-site.xml

Scroll down to the <configuration> tag and add the following between the <configuration> tags

<property>
   <name>dfs.replication</name>
   <value>3</value>
</property>

<property>
   <name>dfs.namenode.name.dir</name>
   <value>file:///usr/local/hadoop/hadoop_data/hdfs/namenode</value>
</property>

namenode$ sudo mkdir -p $HADOOP_HOME/hadoop_data/hdfs/namenode

Change ownership of the newly created directory in the Hadoop folder
namenode$ sudo chown -R ubuntu $HADOOP_HOME

Add namenode and datanode names to masters and slaves files

namenode$ sudo nano $HADOOP_HOME/etc/hadoop/masters

Add the first part of the namenode’s private DNS

e.g. private-dns: ip-172-31-41-148.us-west-2.compute.internal

ip-172-31-41-148

namenode$ sudo nano $HADOOP_HOME/etc/hadoop/slaves
Delete the localhost and replace with the datanode private DNS names


e.g. If you have 3 datanodes with
private-dns-1: ip-172-31-41-149.us-west-2.compute.internal
private-dns-1: ip-172-31-41-150.us-west-2.compute.internal
private-dns-1: ip-172-31-41-151.us-west-2.compute.internal

you'll add the following:

ip-172-31-41-149
ip-172-31-41-150
ip-172-31-41-151

Configure Datanodes

This must be done on each datanode you have in your Hadoop cluster

SSH into the datanodes

localhost$ ssh -i ~/.ssh/personal-aws.pem ubuntu@datanode-public-dns

Configure hdfs-site.xml

datanode$ sudo nano $HADOOP_HOME/etc/hadoop/hdfs-site.xml

Scroll down to the <configuration> tag and add the following between the <configuration> tags

<property>
   <name>dfs.replication</name>
   <value>3</value>
</property>

<property>
   <name>dfs.datanode.data.dir</name>
   <value>file:///usr/local/hadoop/hadoop_data/hdfs/datanode</value>
</property>

datanode$ sudo mkdir -p $HADOOP_HOME/hadoop_data/hdfs/datanode

Change ownership of the newly created directory in the Hadoop folder
datanode$ sudo chown -R ubuntu $HADOOP_HOME

Start Hadoop services

SSH into the namenode

localhost$ ssh -i ~/.ssh/personal-aws.pem ubuntu@namenode-public-dns

Start HDFS

namenode$ hdfs namenode -format namenode$ $HADOOP_HOME/sbin/start-dfs.sh When asked “The authenticity of host ‘Some Node’ can’t be established. Are you sure you want to continue connecting (yes/no)?” type yes and press enter. Even if there is no new prompt, you may need to do this several times - keep typing yes, then enter since it’s the first time for Hadoop to log into each of the datanodes.

You can go to namenode-public-dns:50070 in your browser to check if all datanodes are online. If the webUI does not display, check to make sure your EC2 instances have security group settings that include All Traffic and not just SSH.

Start YARN

namenode$ $HADOOP_HOME/sbin/start-yarn.sh

Start Job History Server

namenode$ $HADOOP_HOME/sbin/mr-jobhistory-daemon.sh start historyserver

You can check if all the processes are running on the namenode by running

namenode$ jps
10595 NameNode
10231 JobHistoryServer
11193 Jps
10944 ResourceManager
10810 SecondaryNamenode

Common HDFS command

Example of working with HDFS

# create local dummy file to place on HDFS
namenode$ echo “Hello this will be my first distributed and fault-tolerant data set\!” | cat >> my_file.txt

# list directories from top level of HDFS
namenode$ hdfs dfs -ls /
… should display nothing but the /tmp directory

# create /user directory on HDFS
namenode$ hdfs dfs -mkdir /user
namenode$ hdfs dfs -ls /

Found 2 items
drwxrwx---   - ubuntu supergroup          0 2015-09-13 19:25 /tmp
drwxr-xr-x   - ubuntu supergroup          0 2015-09-13 19:28 /user

# copy local file a few times onto HDFS
namenode$ hdfs dfs -copyFromLocal ~/my_file.txt /user
namenode$ hdfs dfs -copyFromLocal ~/my_file.txt /user/my_file2.txt
namenode$ hdfs dfs -copyFromLocal ~/my_file.txt /user/my_file3.txt

# list files in /user directory
namenode$ hdfs dfs -ls /user

Found 3 items
-rw-r--r--   3 ubuntu supergroup       50 2015-05-06 22:43 /user/my_file.txt
-rw-r--r--   3 ubuntu supergroup       50 2015-05-06 22:43 /user/my_file2.txt
-rw-r--r--   3 ubuntu supergroup       50 2015-05-06 22:43 /user/my_file3.txt

# clear all data and folders on HDFS
namenode$ hdfs dfs -rm /user/my_file*

15/05/06 22:49:06 INFO fs.TrashPolicyDefault: Namenode trash configuration: Deletion interval = 0 minutes, Emptier interval = 0 minutes.
Deleted /user/my_file.txt
15/05/06 22:49:06 INFO fs.TrashPolicyDefault: Namenode trash configuration: Deletion interval = 0 minutes, Emptier interval = 0 minutes.
Deleted /user/my_file2.txt
15/05/06 22:49:06 INFO fs.TrashPolicyDefault: Namenode trash configuration: Deletion interval = 0 minutes, Emptier interval = 0 minutes.
Deleted /user/my_file3.txt

namenode$ hdfs dfs -rmdir /user
Clone this wiki locally