-
Notifications
You must be signed in to change notification settings - Fork 51
Using two or more docker containers in cluster config #325
Comments
I'm sure Brian will probably reply with a better answer but; why not create your own dockerfile and build your image from scratch? That way you'll have full control of what's in the image, rather than trying to stack them through the doAzureParallel cluster boot process (which sounds a bit iffy to me). There's documentation available here to do this. It sounds to me that the above method you describe probably wouldn't work, but I'm only basing that on a hunch, rather than any real deep technical understanding of how doAzureParallel works! |
I don't' have experience in writing docker files, but if it is possible to set Selenium and R working together in docker file and run it in batch, I will surely start learning. |
Dockerfiles are actually pretty straightforward - if you're already coding with R, Dockerfiles are trivial in comparison :) I'm not sure what OS you use, but I'm assuming Linux. I've also never used Selenium so quickly Googled and found this: https://tecadmin.net/setup-selenium-with-firefox-on-ubuntu/ Given the Dockerfile below which takes standard Ubuntu commands (e.g. apt-get install) then maybe something like the below will work for you:
You'd have to substitute in your different dependencies of course but maybe something like that will do the trick? Also have a Google and see if anyone else has already created a Docker image containing what you need. Unless it's something really niche, I bet someone has already created it. Or at least will have a dockerfile that has pretty much everything you need, and you can add in the last bits yourself. |
Is this what you're after, maybe? https://rpubs.com/johndharrison/RSelenium-Docker |
simon-tarr, thanks for help. I didn't mention, I am on windows 7. As you pointed out, there are docker images for Selenium: https://hub.docker.com/u/selenium/
If I run above code in batch it want work since there are no Selenium images and they are not activated. Is it enoufg to just add selenium image in docker file: Load rocker/tidyverse:3.4.1FROM rocker/tidyverse:3.4.1 Install any dependencies required for the R packagesRUN apt-get update Add all your other dependencies hereInstall the R Packages from CRANRUN Rscript -e 'install.packages(c())' |
I've just realised what you meant in your first message - sorry I misread/misunderstood. Initially thought you asked about loading two different docker images when booting an Azure pool (i.e. in the doAzureParallel configuration file). I now see that you've added FROM selenium/node-firefox into the dockerfile (which admittedly makes a lot more sense!). I'd try building the image from the dockfile above but remove xi6, lobgconf and default-jdk...I should imagine they'd be included in selenium/node-firefox. The instructions on creating containers can be found in that link that I posted in that first reply. Give it a try and let me know how you get on. |
I will give it a try today to check if it is working. Thanks again. |
simon-tarr, if I add RSelenium package to Docker file, I got several dependency errors:
Should I also add Rcpp in installed packages? How shoul I know which dpendency to include? |
The error message above says which packages/dependencies you are missing. It looks like you're missing: Rcpp, xml2, semver, binman and wdman. If these are R packages, they can be installed via the following within your dockerfile:
You might find that certain libraries are required for these packages (e.g. the R package xml2 needs the library libxml2-dev to be installed but as this is already contained in the dockerfile, this particular R package will install fine, assuming it's the only library it requires), but hopefully the base docker image will have all of them already. If not you'll have to look at the error logs and see which ones are missing and add a line to your dockerfile (closing the line off with a backslash if it's not the last line, as per the example dockerfile above). |
I think the problem is in:
It can't install Rcpp. Why not use rocker/tidyverse, it should contain packages like Rcpp? |
I'm not sure which docker images will contain what packages; you'd need to search to find out. According to the dockerfile for rocker/tidyverse it doesn't look like Rcpp is installed within this image: https://hub.docker.com/r/rocker/tidyverse/~/dockerfile/ |
I have moved forward since yesterday. I have successfully installed Rcpp and RSelenium. I trying to install java now since it returns error for java.check() function. |
Here is the final docker file that works in docker locally, hope it will work on batch:
|
Looks good! If it creates an image locally then it should run on Azure :) Let me know how it turns out. |
I would like to ask one more question. It is not related to docker container, but to performance. Lets, say I have 4 nested 4 loops like this:
Is it the best way to put foreach loop in most nested for loop? I have to say numbers are in some way random. Sometimes, third loop can be from 1:10 or last can be from 1:1000. Is it best way to include for loop in last part. Also is it possible to use clusterEval function? |
I'll be honest - this is a little beyond me - it melts my brain thinking of more than two nested loops. I personally find it really difficult to read nested loops and only use them if I really have no other alternative...is there no way of vectorising your code at any stage? EDIT - I realise that sometimes it's unavoidable to use for loops (and nested loops) but if you don't already know about vectorisation, here are some helpful links: http://www.noamross.net/blog/2014/4/16/vectorization-in-r--why.html You're probably not going to see vast performance gains but it would make reading your code easier in some situations. |
simon-tarr, thanks for the links, I managed to simplify the loop. Have one, hopefully last question. For example I would need to to start Selenium in every instance on every VM. It doesn't make sense to do that every time for every step in the loop. Better way would be to start it in every node and that make for loop. I know how to do that using clusterEval function as in this code: but there is no cl (cluster) object in doAzureParallel? |
Hi @MislavSag You should be able to use doParallel package in doAzureParallel. We use it on the merge task (The task that combines all of your loop results together). You will need to make sure that a two tasks are not assign to the same VM or else both tasks will be fighting the same resources of the same tasks. Thanks! |
My idea i to to start web driver in every VM, and then use driver sessions across VM's and nodes on every VM. I can't do that inside foreach loop since it distribute processes across nodes. In normal R session it would go something like:
but I am not sure how to achieve this using Batch, since I have more than one VM's. |
I will try one more time again :) On my local machine I can implement Selenium testing using headless Firefox in the following way:
My problem is that I don't know how to implement this code in Azure batch. I know how to do foreach loop, but not sure how to start headless Firefox on each VM, and than start sessions on each node (use clusterevalQ)? |
Hi @MislavSag There's no cluseterevalQ equivalent in doAzureParallel. The doAzureParallel package is more equivalent to doParallel instead of the 'parallel' R package. Unfortunately, I'm not that familiar with RSelenium package. Do you need to start the Selenium server for all the tasks on each VM? Or do you need to start the Selenium driver on each VM? There is a start task command line in the pool cluster config. This command will be run on the host, not on docker image. My thoughts are you can start another docker image with Selenium driver in the start task command line. Make sure that docker image persists even after the start task end then make sure the ports are open for both docker images. https://github.com/Azure/doAzureParallel/blob/master/docs/01-getting-started.md#cluster-settings Thanks, |
Hi @brnleehng, I thought it would be best to run 2 docker containers first. But I didn't know how to start and use port from one container in another container. So I have chosen another way. I build a docker image that installs Firefox, Java and RSelenium, everything what is needed to tun Selenium inside R. I have tried to run a container from this image in docker on local machine and it worked, I could start Selenium and made some tests. I would like to do that in parallel in batch which means I should execute following command on every VM: This command would open port 4567L and made it available for RSelenium drivers. I can't send this command through foreach function since it is not possible to run more than one same port (4567L) on one machine. |
Hi Mislav, If you have the dockerfile up on docker hub, can you share it with me? There is a way of getting parallelize your tests with RSelenium with doAzureParallel. However doAzureParallel does not support this scenario out of the box. Here's the example:
# Create a cluster with 5 VMs with 4 workers
cl <- doAzureParallel::makeCluster("cluster.json")
doAzureParallel::registerDoAzureParallel(cl)
# finally, do Selenium test in parallel
foreach_loop <- foreach(i = 1:nrow(df),
.packages = c("RSelenium", "doParallel", "parallel", "RCurl", "httr"),
# .combine = 'rbind',
.export = c("df")) %dopar% {
cl <- parallel::makeCluster(4)
registerDoParallel(cl)
rD <- RSelenium::rsDriver(
browser = "firefox",
extraCapabilities = list(
"moz:firefoxOptions" = list(
args = list('--headless')
)
)
)
# export to nodes and start driver on every node
clusterExport(cl, "rD")
clusterEvalQ(cl, {
library(RSelenium)
library(RCurl)
library(httr)
driver <- rD$client
driver$open()
Sys.sleep(2)
driver$navigate("http://www.google.com")
})
} Thanks, |
Hi Brian, You can find my dockerfile here: Thanks for help. Before I tried it I would like to check if I get everything tight. My cluster.json file shoul look like this:
Than start cluster with:
Should I skip this part?:
Then I see you have started RSelenium on every VM inside foreachloop. But this should be calle donly once, on the beggining of the loop. If I add one more line of code in your foreach loop, it will start driver every time:
How to avoid cl |
Sorry, updated the example. You are running a task on each VM. Azure Batch schedules each task on a VM first, round-robin style. You can keep this section setVerbose(TRUE)
setAutoDeleteJob(FALSE)
generateCredentialsConfig("credentials.json")
setCredentials("credentials.json")
generateClusterConfig("cluster.json")
cluster <- makeCluster("cluster.json")
registerDoAzureParallel(cluster)
getDoParWorkers()
opt <- list(wait = FALSE) By using this method, you need the same number of tasks as VMs. In this case, it will be 2 tasks The property {
"name": "scraping",
"vmSize": "Standard_F4",
"maxTasksPerNode": 1,
"poolSize": {
"dedicatedNodes": {
"min": 2,
"max": 2
},
"lowPriorityNodes": {
"min": 0,
"max": 0
},
"autoscaleFormula": "QUEUE"
},
"containerImage": "theanswer0207/firefox-headless-r:latest",
"rPackages": {
"cran": [],
"github": [],
"bioconductor": []
},
"commandLine": [],
"subnetId": ""
} # Create a cluster with 5 VMs with 4 workers
cl <- doAzureParallel::makeCluster("cluster.json")
doAzureParallel::registerDoAzureParallel(cl)
# finally, do Selenium test in parallel
foreach_loop <- foreach(i = 1:number_of_vms,
.packages = c("RSelenium", "doParallel", "parallel", "RCurl", "httr"),
# .combine = 'rbind',
.export = c("df")) %dopar% {
data <- download.file("https://raw.githubusercontent.com/Test/Sample/" + i + ".csv")
# Standard_F4 are 4 core machines. Use all cores.. Same as your current workstation
cl <- parallel::makeCluster(4)
registerDoParallel(cl)
rD <- RSelenium::rsDriver(
browser = "firefox",
extraCapabilities = list(
"moz:firefoxOptions" = list(
args = list('--headless')
)
)
)
# export to nodes and start driver on every node
clusterExport(cl, "rD")
clusterEvalQ(cl, {
library(RSelenium)
library(RCurl)
library(httr)
driver <- rD$client
driver$open()
Sys.sleep(2)
driver$navigate("http://www.google.com")
})
} Thanks, |
I am running something right now on Azure batch. After that, I will try your code immediately! |
I have one question. Why this part:
has to be inside foreach loop? |
I have returned to this problem today. No success. The main problem is this function:
This function start a selenium server and browser. It should be started on each node (VM) separately. For example, on local machine I start it on the machine and than all cores can listen that driver. If I just put this command inside foreach loop it would try to start few times each driver on same machine which doesn't make sense. |
Hi @MislavSag I don't think doAzureParallel works out of the box because the container immediately gets removed once it's used (On every task). The workaround is the example shown above. Basically, we are running doParallel that also starts the selenium server and browser on each VM with a single task. The caveat is you need to run the foreach up to the number of VMs available (for example, foreach(i = 1:number_of_vms) and you have to manage the data spread). Thanks, |
If I use
is number of VM's equal to 2 or 4 (2 * 2)? |
You will have two VMs. In the above, 'min' refers to the minimum number of VMs you want in your cluster, 'max' referrs to the maximum. When min = max, it means that your cluster won't autoscale and, in this example, you'd have 2 VMs. More information on autoscaling can be found within the documentation: https://github.com/Azure/doAzureParallel/blob/master/docs/32-autoscale.md |
I am new to Azure batch service and this package. I was following instructions on the introduction page and successfully implement Azure batching for a simple foreach loop.
I saw that in configuration file, there is a parameter "containerImage" with default "rocker/tidyverse:3.4.1". I am not sure is it possible to add two or more images in "containerImage" and use both? More concretely, is it possible to put "selenium/standalone-firefox" image to containerImage parameter and pull it together with "rocker/tidyverse:3.4.1"? If the answer is yes, is it possible to run Selenium inside R script in usual way using RSelenium package?
The text was updated successfully, but these errors were encountered: