-
Notifications
You must be signed in to change notification settings - Fork 278
user>rmr>Home
This R package allows an R programmer to perform statistical analysis via MapReduce on a Hadoop cluster.
- A Hadoop cluster, CDH3 and higher or Apache 1.0.2 and higher but limited to mr1, not mr2. Compatibility with mr2 from Apache 2.2.0 or HDP2. For configuration suggestions see Memory management in rmr2.
- R installed on each node of the cluster (developed and tested on R 2.14.1). Revolution R Community 4.3 or 5.0 can be used, if you upgrade to RJSONIO 0.95 (which must be downloaded from CRAN, as it is not available in the REVO 2.12 repository) and create a symbolic link from /usr/bin/Revoscript to /usr/bin/Rscript. See Compatibility
- Install the required R packages on each node. Check the DESCRIPTION file,
Depends:
line, for the most up to date list of dependencies. The suggestedquickcheck
is needed only for testing and a link to it can be found on its repo. - rmr2 itself needs to be installed on each node. Download it from the Downloads page and then, at the shell prompt, enter
R CMD INSTALL rmr2_<specific version>.tar.gz
. rmr2 is not available on CRAN. - Make sure that the packages are installed in a default location accessible to all users (R will run on the cluster as a different user from the one who has started the R interpreter where the mapreduce calls have been executed) on every node.
- Make sure that the environment variables
HADOOP_CMD
andHADOOP_STREAMING
are properly set. The former should point to the mainhadoop
command, the latter to the streaming jar, a file called something likehadoop-streaming*.jar
that is part of most hadoop distributions. For some distributions,HADOOP_HOME
is still sufficient for R to find everything that's needed so if that works for you you can keep it that way, but it is not recommended anymore. Optionally, you can setHDFS_CMD
if rmr can't find thehdfs
executable, which only results in some deprecation warnings. Its value should be the path to thehdfs
command.
Examples:
export HADOOP_CMD=/usr/bin/hadoop
export HADOOP_STREAMING=/usr/lib/hadoop/contrib/streaming/hadoop-streaming-<version>.jar
For people who use RPMs for their deployments, courtesy of jseidman, we have RPMs for rmr and its dependencies. These RPMs are available in this repo: https://github.com/jseidman/pkgs. Note that currently there's only CentOS 5.5 64bit RPMs, but the source files to create the RPMs are in the same repo, so it should be easy to build for other RH-based distros. jseidman reports using RPMs along with Puppet to deploy all packages, applications, etc. to their (Orbitz) Hadoop clusters.
For people who use EC2 (not EMR), in the source package under the tools directory there is a whirr script to fire up an EC2 rmr cluster.
If you use Globus Provision, check out this https://github.com/nbest937/gp-rhadoop (very alpha as of this edit), courtesy nbest.
MapR provides specific instructions for their distribution of Hadoop