-
Notifications
You must be signed in to change notification settings - Fork 278
FAQ
I think that documentation should be organized better than a FAQ, but if information haven't found its final destination yet, better have some place for it. My goal is to keep the FAQ a staging area and not let it grow to be the documentation.
R and all the libraries needed, including rmr and its dependencies and any other library that the developer needs. This is far from perfect and we are aware that some people simply can't install anything on their cluster because of policies, let alone technical difficulties. There seems to be a way around it but it's not very easy to implement. In dev under tools/whirr there is a whirr script that will do this for you, tested on AWS EC2.
Read this too: https://github.com/RevolutionAnalytics/RHadoop/wiki/Installing-RHadoop-on-RHEL
As a demo, unsupported, you can look under tools/whirr.
If they are temporary, hadoop may retry a specific task a number of times. If that number is exceeded, the hadoop job will fail and the mapreduce call will fail. The stderr from R is your friend. In hadoop standalone mode, which is highly recommended for development, it simply shows up in console mixed with somewhat verbose hadoop output. In pseudo distributed and distributed modes it ends up in a file somewhere. This is a good resource about that.
See [Debugging rmr programs](user>rmr>Debugging rmr programs)
This isn't an easy question, but let me try. Understanding mapreduce is the first priority (the original google paper is still the reference point). Reading a variety of papers with different applications, probably the ones closer to one's problem domain. Cloudera's Hammerbacher has a collection on Mendeley and another one is on the atbrox blog. Somewhat off topic, I would also recommend people acquaint themselves with the parallel programming literature for architectures other than mapreduce.
In many instances this is cause by permissions on or location of the package. When running in pseudo-distributed or distributed mode, the R processes running on the nodes are executed not as "you", the user who issued the mapreduce job, but as some dedicated user (variable depending on specific distro). The sure fire approach seem to install all the needed packages (rmr, dependencies and other package user code needs) in a common location such as /usr/share/R/library
installing them with
R>install.packages(“rmr”, lib=”/usr/share/R/library”)
That path has to be accessible to R. You can check that with the function .libPaths(). But it's not guaranteed that that function will return the same values run as the you user or the hadoop user in the context of a mapreduce job. For Revolution R something like /usr/lib64/Revo-6.0/R-2.14.2/lib64/R/library
, of course adapted to your specific version.
Yes, just leave the reduce
parameter to its default, NULL
. This has changed in version 2.0. Before, you needed to supply the following argument to mapreduce backend.parameters = lisat(hadoop = list(D = 'mapred.reduce.tasks=0'))
.
install_github('rmr2', repo = 'RHadoop', username='RevolutionAnalytics', branch='rmr-2.0.1', subdir="rmr2/pkg")
where branch is what you need to change (and the package was called rmr pre 2.0.0). You still have to do this on every node of the cluster, no shortcuts there.
The likely cause is that RStudio does not read your environment. See this discussion on the RStudio support site.
Many thanks to Joe and Iver and other users who provided or suggested content for this page