Installing Cloudera CDH3 on Ubuntu 12.04

this info may be old, since cm4 maybe supported cdh3 more directly at some point. In any case cdh3 is deprecated.

Can't seem to select packages any more with cm4 can't keep it from trying to install hue-common which depends on python2.6 which it doesn't want to install because of ubuntu-desktop ? trying this to install python2.6

Run below command to add PPA to your repository

sudo add-apt-repository ppa:deadsnakes

Then you can update your repository and install Python 2.6

sudo apt-get update

sudo apt-get install python2.6

python2.6-dev is available also

This is a bit of a Frankstein since there isn't an official Cloudera release for Ubuntu 12.04 LTS. There is a 'Maverick" release (3 Ubuntu versions before the 12.04 "Precise" release) that seems to work.

You don't want to use the automated install through Cloudera Manager for at least two reasons. It doesn't seem to give you the selectivity to limit the downloading. It tries to download all packages. It also installs oracle j2sdk1.6-oracle, which is preferred by Cloudera, but that takes a long time. They say anything above 1.6 is okay, but they install oracle 1.6 regardless of what you have on the machine. I run with java-6-oracle and java-7-oracle as installed from oracle by an excellent ppa for ubuntu (see elsewhere).

This assumes you know how to run cloudera manager through the browser after everything is installed and the cloudera manager server, plus cloudera manager agent on all nodes, is running.

There are many Cloudera web pages that address these issues. These notes are very brief for redoing the install at 0xdata or home machines. You should make sure any previous hadoop, cloudera or mapr hadoop packages are fully uninstalled before doing this install. I don't know if multiple hadoop versions on a box work. CDH4 Manager does offer a choice to install CDH3 as a choice. However it forces a full hadoop install, including hue. Hue requires python 2.6, so for machines with python 2.7 installed, it won't complete the install, which fails the whole CDH3 install.

Here's an important bit of knowledge. The editing of the config.ini for the scm agents below, is critical to the manager's list of "managed hosts". If you that setupright, you can then use the manager update wizard through the browser to validate/complete the install. It will tell you it assumes you have the right packages installed on each host, which you've done below (not including hue, for instance). Pretty much you can do everything mentioned below, then do the browser upgrade wizard, and pick CDH3 for a choice and "install from packages". It will tell you that it assumes you installed everything.

This avoids the "full hadoop install with hue" problem. (Select "Custom Services" when you get to the hadoop install in the browser...i.e :7180/cmf/express-wizard/services?clusterId=1

Apache Hadoop 0.20.2 and Cloudera cdh3u6 are not the same

If you type "hadoop version" at the command line after installing cdh3u6, you'll see version numbers that say 0.20.2

But Cloudera modified some things. They added security methods. Notably the ipc server version number got bumped. You'll see ipc server version mismatches in the cdh3u5 logs when h2o talks to it, if you try to use apache 0.20.2 jars in h2o to talk to a cdh3u5 cluster.

The right h2o hdfs_version is now called "cdh3" (it used to be cdh3u5)

Working on all nodes at once.

Most of these things want to be the same on all nodes (except only one gets the cloudera manager server)

Use a tabbed ssh tool like 'clusterssh' to do them all at the same time (including 'vi')

Resolving unmet dependencies

It's good to start out with a clean apt-get install..i.e. no unmet dependencies.

If you see "not upgraded" after apt-get install like this:

0 upgraded, 0 newly installed, 0 to remove and 8 not upgraded.

you can fix with a couple of the things mentioned here: http://askubuntu.com/questions/140246/how-do-i-resolve-unmet-dependencies

but it's probably faster to install 'aptitude" and use it

sudo apt-get install aptitude
sudo aptitude update && sudo aptitude full-upgrade

Java

CDH3 looks for java in /usr/lib/jvm. If you have something that matches this check, you're fine. Otherwise you have to add an env variable to point to your java, separately for the cloudera server and agent.

per https://ccp.cloudera.com/display/ENT41DOC/Using+Custom+Java+Home+Locations To see how Cloudera Manager chooses a default JDK, review the contents of /usr/lib64/cmf/service/common/cloudera-config.sh.

if [ -z "$JAVA_HOME" ]; then
  for candidate in \
    /usr/lib/jvm/java-6-sun \
    /usr/lib/jvm/java-1.6.0-sun-1.6.0.*/jre/ \
    /usr/lib/jvm/java-1.6.0-sun-1.6.0.* \
    /usr/lib/jvm/j2sdk1.6-oracle/jre \
    /usr/lib/jvm/j2sdk1.6-oracle \
    /usr/lib/j2sdk1.6-sun \
    /usr/java/jdk1.6* \
    /usr/java/jre1.6* \
    /Library/Java/Home \
    /usr/java/default \
    /usr/lib/jvm/default-java \
    /usr/lib/jvm/java-openjdk \
    /usr/lib/jvm/jre-openjdk \
    /usr/lib/jvm/java-1.6.0-openjdk-1.6.* \
    /usr/lib/jvm/jre-1.6.0-openjdk* ; do
    if [ -e $candidate/bin/java ]; then
      export JAVA_HOME=$candidate

If you installed java with the excellent java-6-installer or java-7-installer ppa, your /usr/lib/jvm/ dir will not have a name that matches the above. You can cp -r the dir to 'java-default' and work, but better to put the path names in the config files as described below.

If you don't have any java, you can install as noted elsewhere using the ppa. It will handle updates for you then. as per http://www.webupd8.org/2012/01/install-oracle-java-jdk-7-in-ubuntu-via.html

sudo add-apt-repository ppa:webupd8team/java
sudo apt-get update
sudo apt-get install oracle-java7-installer
or
sudo apt-get install oracle-java6-installer

Even if you set JAVA_HOME in /etc/profile to point to /usr/lib/jvm/ after this, or .bashrc for root, that won't be enough for the CDH manager server/agent. See instructions below to fix. I still update /etc/profile with

export JAVA_HOME=<jdk-install-dir>
export PATH=$JAVA_HOME/bin:$PATH

libzip1

needed for CDH3, but removed from ubuntu 12.04

wget http://launchpadlibrarian.net/48191694/libzip1_0.9.3-1_amd64.deb
sudo dpkg -i libzip1_0.9.3-1_amd64.deb

Getting the CDH3 packages

Do this on all nodes. Using a tabbed ssh tool like 'clusterssh' makes this easy

This should work, but you'd have to add the repo for the cloudera manager too (the latest)

sudo add-apt-repository "deb http://archive.cloudera.com/debian maverick-cdh3u6 contrib"

But I like to just add the two files to /etc/apt/sources.list.d per

https://ccp.cloudera.com/display/CDHDOC/CDH3+Installation

cd /etc/apt/sources.list.d
sudo vi cloudera.list
   deb http://archive.cloudera.com/debian maverick-cdh3u6 contrib
   # don't need source
   # deb-src http://archive.cloudera.com/debian maverick-cdh3u6 contrib


sudo vi cloudera-cm4.list
   # Packages for Cloudera's Distribution for Hadoop, Version 4, on Ubuntu 12.04 x86_64
   deb [arch=amd64] http://archive.cloudera.com/cm4/ubuntu/precise/amd64/cm precise-cm4 contrib
   # don't need source
   # deb-src http://archive.cloudera.com/cm4/ubuntu/precise/amd64/cm precise-cm4 contrib

Update: just use cloudera.list and specify a particular cm4 to make sure hue packages aren't installed. Note that arch=amd64 is required for the cm4

cd /etc/apt/sources.list.d
sudo vi cloudera.list
    deb http://archive.cloudera.com/debian maverick-cdh3u6 contrib
    deb [arch=amd64] http://archive.cloudera.com/cm4/ubuntu/precise/amd64/cm precise-cm4.0.1 contrib

add the GPG key for confirming good downloads

sudo apt-get install curl
curl -s http://archive.cloudera.com/debian/archive.key | sudo apt-key add -
sudo apt-get update

Now just get these packages.

 sudo apt-get install hadoop-0.20 hadoop-0.20-native hadoop-0.20-datanode hadoop-0.20-namenode hadoop-0.20-secondarynamenode

WARNING: apparently CDH3 uses mapred to do some group permission authorization work

So you need to install these, and get mapred up as a service on top of hdfs, even if you don't plan on using mapred (apparently: weird stack traces when H2O does access if you don't?) You will also need log4j jars and guava jar in addition to the core hadoop jar, from cdh3u6 for h2o. H2O handles that for you.

 sudo apt-get install hadoop-0.20-jobtracker hadoop-0.20-tasktracker

On all nodes

apt-get install cloudera-manager-agent cloudera-manager-daemons

On the cloudera manager server you'll point your browser to :7180

You need to install the little postgresql db it uses

apt-get install cloudera-manager-server (on one machine only)
apt-get install cloudera-manager-server-db

sudo service  cloudera-scm-server-db initdb

If you get this error:

/usr/share/cmf/bin/initialize_embedded_db.sh line 178:
cd: /var/run/cloudera-scm-server

just continue and run the server-db start below, then stop it, and redo the initdb and start. You need both of these service commands in all cases.

sudo service cloudera-scm-server-db start

Starting the Cloudera Manager Server

Make sure you point to java home

sudo vi /etc/default/cloudera-scm-server

Set the JAVA_HOME environment variable to the java home in your environment. For example, you might modify the file to include the following line:

export JAVA_HOME=/usr/lib/jvm/java-6-oracle

then

sudo service cloudera-scm-server start

To start the Cloudera Manager Agent

Modifying CMF_AGENT_JAVA_HOME

sudo vi /etc/default/cloudera-scm-agent

Set the CMF_AGENT_JAVA_HOME environment variable to the java home in your environment. For example, you might modify the file to include the following line:

export CMF_AGENT_JAVA_HOME=/usr/lib/jvm/java-6-oracle

Save and close the cloudera-scm-agent file.

On every Cloudera Manager Agent host machine, configure the Cloudera Manager Agent to point to the Cloudera Manager Server by setting the following properties in the /etc/cloudera-scm-agent/config.ini configuration file. server_host will probably initially say "localhost" and server_port correctly point to 7182:

server_host=<Name of host machine where the Server is running>
server_port=<Port on host machine where the Server is running (default: 7182)>

Now run this command on each Agent machine. You'll typically also have an agent on the cloudera-scm-server machine

sudo service cloudera-scm-agent start

Delete the chkconfig that the apt-get installs created, to avoid host inspector conflict messages

Host Inspector in the browser will say there are these chkconfig's (startup scripts) that exist that may conflict with the manager. (yellow messages). Delete with these commands on all hosts.

chkconfig -l | grep hadoop
chkconfig --del hadoop-0.20-jobtracker hadoop-0.20-tasktracker hadoop-0.20-datanode hadoop-0.20-namenode hadoop-0.20-secondarynamenode

Add hdfs and mapred services using :7180 and browser

When you first go to the managers :7180 address with the browser, you'll go thru an install. Let it complete and fail on the hue install. Don't cancel it as it will rollback and not save state saying the install is "done". Just go to the :7180 port again by typing in the URL in the browser. You should see the cloudera manager home page then.

admin/admin is user/password best to start with fresh dirs for all the dfs namenode and datanode directories. There can be odd behaviors with namespace ids getting out of sync on namenode vs datanode when you format otherwise. They don't format the datanodes.

When I config hdfs in the browser, I remove the two choices for the data node and name node data directories, and point them to a special partition. With just one choice, you also have to change "DataNode Failed Volumes Tolerated" to 0 (from 1).

Remember to point the secondary name node directories also, if you change from the default /opt....

When it runs node inspector, you may see complains about the hadoop and mapred users not having the correct group membership. Easiest to fix that up at the command line with

usermod -a -G <group> hadoop

and usermod -a -G mapred

for whatever groups it mentions for a username. (do this on all the offending nodes). If there is a version mismatch mentioned in node inspector, just run the node upgrade wizard. You'll see the nodes in the managed hosts section, if they have not been through that process already. (oracle j2sdk will probably get installed, but not used?)

User permissions

It's useful to have just one username (hduser) for everyone to use, so that file permissions inside hdfs are good. (root can't override them). But the single and multi jvm python tests run h2o as current user, so you may want a particular user to have the right group permission

Best to put the user running h2o in mapred, hadoop, hdfs (and hduser) groups like this on all hosts

sudo usermod -a -G mapred,hadoop,hdfs,hduser <username>

Logout and in again and user 'groups' to verify

That user may need to be added to all hosts with:

adduser <username>

Review the section below called "Hadoop creates this in /tmp when you do the first h2o ImportHdfs". You probably will need to change permissions or owner or both, to this dir, to get H2O to be able to ImportHdfs without permission problems:

hadoop dfs -chmod 777 hdfs://192.168.0.37/tmp/mapred/system

If you create directories for datasets in the hdfs system, you may need to do it as user 'hdfs' first or change the write permissions on the top level "/". The default acls when you setup hdfs are 022 for hadoop (can change that in the cloudera manager configuration)

hadoop dfs -chmod 777 hdfs://192.168.0.37/

If you don't deal with that issue, you'll see stuff like this:

hadoop dfs -mkdir /datasets
  mkdir: org.apache.hadoop.security.AccessControlException: 
  Permission denied: user=hduser,    access=WRITE, 
  inode="/":hdfs:supergroup:drwxr-xr-x

There is a configuration variable that purports to turn off permission checks. You can clear the permission check in the cloudera manager, but I wasn't successful with that. It's good to get the permission stuff above sorted out correctly, I think, anyhow.

Check HDFS Permissions
dfs.permissions

Background: Hadoop creates this in /tmp when you do the first h2o ImportHdfs

The /tmp is created when you add the hdfs service

drwxrwxrwt   - mapred hdfs          0 2013-02-13 21:43 /tmp/mapred
hadoop dfs -ls hdfs://192.168.0.37/tmp/mapred
drwx------   - mapred hdfs          0 2013-02-13 21:43 /tmp/mapred/system

Note that only the hdfs user has access to this system dir (not the groups)

hadoop dfs -ls hdfs://192.168.0.37/tmp/mapred/system
ls: could not get get listing for 'hdfs://192.168.0.37/tmp/mapred/system' : 
org.apache.hadoop.security.AccessControlException: 
  Permission denied: user=kevin, access=READ_EXECUTE, 
  inode="/tmp/mapred/system":mapred:hdfs:drwx------

Rather than requiring the user to be 'mapred', I'll open this up to everyone with a hadoop chmod. I guess I could have changed the owner to be 'hduser' since that's a common user we use.

hadoop dfs -chmod 777 hdfs://192.168.0.37/tmp/mapred/system

Adding to core_site.xml to be able to use hdfs-relative path names

CDH3 doesn't expose this configuration variable in the cloudera manager, and CDH4 has a different name so it's deprecated there, but still works.

It's useful to change the core-site.xml so you can use relative URIs on the machines. You'll have to change core-site.xml on all machines you want to use hadoop command line commands. This is useful for moving things to and from hdfs, and manipulating dirs and permissions

vi /etc/alternatives/hadoop-0.20.2-conf/core-site.xml to look like this. Replace 192.168.1.176 with the name node IP. Cloudera says it should be a hostname rather than an IP, but this seems to work. Add : if you don't use default port. (8020 is the default port for this)

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!-- Put site-specific property overrides in this file. -->
<configuration>
  <property>
    <name>fs.default.name</name>
    <value>hdfs://192.168.1.176</value>
  </property>
</configuration>

Security issues when starting the mapred service

Apparently /mapred is needed in the hdfs filesystem

I had usergroup exceptions reported in the logs when trying to start mapred service. Finally I made myself hduser, and created /mapred, which ended up with these permissions:

hadoop dfs -mkdir /mapred
hadoop dfs -ls /
    drwxr-xr-x   - hduser supergroup          0 2013-02-18 13:11 /mapred

hduser had these groups permission on the client machine (my laptop) that had cdh3 installed but was not part of the cluster:

hduser@Kevin-Ubuntu4:~$ groups hduser
    hduser : hduser sudo kevin 0xdiag hadoop hdfs mapred mapr

I was able to restart the mapred service after creating this. Interestingly when I did the mkdir, it created with group supergroup. It wouldn't let me change ownder without being a "superuser" (is that local su, or a member of hdfs supergroup?)

hduser@Kevin-Ubuntu4:~$ hadoop dfs -chown mapred /mapred
chown: changing ownership of 'hdfs://192.168.1.176     
    /mapred':org.apache.hadoop.security.AccessControlException: Non-super user cannot change owner.

at the top of the hdfs filesystem now: hadoop dfs -ls / drwxrwxrwx - hdfs supergroup 0 2013-02-14 12:12 /datasets drwxr-xr-x - hduser supergroup 0 2013-02-18 13:11 /mapred drwxr-xr-x - mapred supergroup 0 2013-02-18 13:12 /tmp

hadoop dfs -ls /tmp
drwxr-xr-x   - mapred supergroup          0 2013-02-18 13:12 /tmp/mapred

Local file system permissions for the dirs that hold the mapred and hdfs files

I had changed the local file permissions on the /home/cdh3 I created as the dir for all mapred related files (set in configuration when setting up the mapred service in cdh3). Not sure what chown and chgrp it should have, but I used "mapred" and "hadoop". I did a chmod 777 on that dir too. I guess I setup /home/cdh3 for both hdfs and mapred. Under that is dfs and mapred

hduser@mr-0x6:~$ cd /home
hduser@mr-0x6:/home$ ls -ltr | grep cdh3
drwxrwxrwx  4 mapred  hadoop   4096 Feb 18 12:48 cdh3

hduser@mr-0x6:/home/cdh3$ ls -ltr
drwxr-xr-x 4 hdfs   hadoop 4096 Feb 14 11:33 dfs
drwxrwxrwx 3 mapred hadoop 4096 Feb 18 12:48 mapred

hduser@mr-0x6:/home/cdh3$ cd mapred
hduser@mr-0x6:/home/cdh3/mapred$ ls -ltr
drwxr-xr-x 7 mapred hadoop 4096 Feb 18 13:12 local

hduser@mr-0x6:/home/cdh3/mapred/local$ ls -ltr
drwxr-xr-x 2 mapred mapred 4096 Feb 18 12:48 userlogs
drwx------ 2 mapred mapred 4096 Feb 18 13:12 ttprivate
drwxr-xr-x 2 mapred mapred 4096 Feb 18 13:12 tt_log_tmp
drwxr-xr-x 2 mapred mapred 4096 Feb 18 13:12 toBeDeleted
drwxr-xr-x 2 mapred mapred 4096 Feb 18 13:12 taskTracker

Example hadoop command line. (changed to hdfs dfs in CDH4)

hadoop dfs -ls <URI>
hadoop dfs -mkdir <URI>
hadoop dfs -copyFromLocal <URI>
hadoop dfs -copyToLocal <URI>
hadoop dfs -chgrp hdfs <URI>
hadoop dfs -chown hdfs <URI>
hadoop dfs -chmod 777 <URI>

Java heap sizes for Cloudera services

This is a deep subject and may show up as issue on install to systems with lots of dram. Placeholder for now. Here's a nice link http://samdarwin.blogspot.com/2013/01/hadoop-heap-size.html

Provide feedback

Saved searches

Use saved searches to filter your results more quickly