Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft: Arc runner computeenv #16750

Draft
wants to merge 26 commits into
base: dev
Choose a base branch
from

Conversation

maikenp
Copy link
Contributor

@maikenp maikenp commented Sep 27, 2023

This PR is related to #16653

It is based on the same branch, but I am now using the tool_script.sh as the executable that ARC runs.
I also demonstrate a very brute-force and simplistic way of using the compute_environment to rewrite the input paths in order for ARC to get the correct paths for the command-line when running on the remote site (that does not have a shared directory with the galaxy server).

I am still using a custom ARC tool - to avoid complications with complex path resolutions like e.g. this tool: https://github.com/galaxyproject/tools-iuc/blob/main/tools/minimap2/minimap2.xml where there are softlinks: ln -f -s '/storage/galaxy/data/datasets/4/5/c/dataset_45cf1bcb-28b1-4167-878d-1fb17636064e.dat' reference.fa && minimap2 --q-occ-frac 0.01 -t ${GALAXY_SLOTS:-4} reference.fa '/storage/galaxy/data/datasets/4/5/c/dataset_45cf1bcb-28b1-4167-878d-1fb17636064e.dat' -a | samtools view --no-PG -hT reference.fa | samtools sort -@${GALAXY_SLOTS:-2} -T "${TMPDIR:-.}" -O BAM -o '/storage/galaxy/data/jobs/000/228/outputs/dataset_03e29ceb-8733-4836-8981-90391c58105f.dat' in the command-line created - and I do not currently know how to solve this.

The custom test prototyping tool is:

<tool id="hello_arc" name="ARC hello example">
  <description>is a simple hello world python script</description>

 <command detect_errors="exit_code"><![CDATA[  
 /bin/bash '$arcjob_exe' "$message" --test >> command_out.txt; 
 /bin/echo "Some post processing job"  >> command_out.txt
 ]]>
 </command>
  
  <inputs>
   <param name="message" type="text" value="Hi from galaxy job wrapper" label="Hello message"/>
   <param name="arcjob_exe" type="data" label="ARC executable"/>
   <param name="arcjob_remote_filelist" type="data"  label="Upload file with list of remote file source paths"/>
 </inputs>

 
 <outputs>
   <collection name="output_job" type="list"  label="ARC job outputs">
     <discover_datasets pattern="(?P&lt;designation&gt;.+.*\..*)\.*" directory="./"   recurse="true" />
   </collection>
   

   <collection name="output_all" type="list"  format="txt" label="ARC job logs">
     <discover_datasets pattern="(?P&lt;designation&gt;.+.*).*" directory="./gmlog/"   recurse="true" />
   </collection>

   <!-- This does not work as galaxy moves the data source to the jobs workdir in galaxy - and if I rewrite the path then galaxy does not find the correct path obv. 
	Not sure how to disentable the moving from the data folder to the jobs folder, and the path rewrite for the remote resource -->
   <!--<data name="outputfile" format="txt"/>-->
 </outputs>
 
 <help>
   A tool used for testing the basic features for a job running on a remote site with ARC. 
  </help>
</tool>

This tool demonstrates uploading of local galaxy files to the remote ARC server. Also downloading of all output from the job on the ARC side to galaxy (however note, the pattern for finding the ARC logs does not work currently, wrong pattern, the files are present in the galaxy job dir though).
In addition, the tool - just for testing purposes - expects a file with a list of remote input files that ARC on the remote server side will download there, not involving Galaxy at all. This is not the solution that we will end up with, but it is for demonstrating the datastaging capability in ARC.

The tool is tested using the following file-list file:

$ cat remote_list.txt
http://download.nordugrid.org/repos/6/centos/el7/source/updates/repodata/repomd.xml
http://download.nordugrid.org/repos/6/rocky/9/source/updates/repodata/repomd.xml

Also, for testing, upload a bash script that is called by tool_script.sh for example

#!/bin/bash

/bin/echo "Hello from runscript to arcout1.txt" >> ./arcout1.txt
/bin/echo "Running on compute node: $(hostname)" >> ./arcout1.txt
/bin/echo "Time is $(date)" >> ./arcout1.txt
/bin/echo "Sleeping for 60s " >> ./arcout1.txt

/bin/echo "Hello from runscript to arcout2.txt" >> ./arcout2.txt
/bin/echo "Running on compute node: $(hostname)" >> ./arcout2.txt
/bin/echo "Time is $(date)" >> ./arcout2.txt
/bin/echo "Sleeping for 60s " >> ./arcout2.txt

/bin/sleep 60

>&1 /bin/echo "Some output to stdout from the runhello.sh executable."
>&1 /bin/echo "More output to stdout from runhello.sh - this is from message tag in tool:  $1"

The tool_script.sh produced by Galaxy is:

/bin/bash './runhello.sh' "Hi from galaxy job wrapper" --test >> command_out.txt; /bin/echo "Some post processing job"  >> command_out.txt

Once the job is done the workdir of the galaxy job will look like this:

[root@galaxy-arc-test-fresh ~]# ls -lhrt /storage/galaxy/data/jobs/000/284
total 12K
drwxr-xr-x. 2 galaxy galaxy   6 Sep 27 15:03 inputs
-rwxr-xr-x. 1 galaxy galaxy 347 Sep 27 15:03 tool_script.sh
drwxr-xr-x. 3 galaxy galaxy  26 Sep 27 15:05 working
drwxr-xr-x. 2 galaxy galaxy  44 Sep 27 15:05 outputs
-rw-r--r--. 1 galaxy galaxy  59 Sep 27 15:05 galaxy_284.o
-rw-r--r--. 1 galaxy galaxy  59 Sep 27 15:05 galaxy_284.e
[root@galaxy-arc-test-fresh ~]# ls -lhrt /storage/galaxy/data/jobs/000/284/working/
total 0
drwxr-xr-x. 3 galaxy galaxy 151 Sep 27 15:05 2812507f8fca
[root@galaxy-arc-test-fresh ~]# ls -lhrt /storage/galaxy/data/jobs/000/284/working/2812507f8fca/
total 20K
drwxr-xr-x. 2 galaxy galaxy 154 Sep 27 15:05 gmlog
-rw-r--r--. 1 galaxy galaxy   0 Sep 27 15:05 arc.out
-rw-r--r--. 1 galaxy galaxy   0 Sep 27 15:05 arc.err
-rw-r--r--. 1 galaxy galaxy 160 Sep 27 15:05 arcout1.txt
-rw-r--r--. 1 galaxy galaxy 160 Sep 27 15:05 arcout2.txt
-rw-r--r--. 1 galaxy galaxy 183 Sep 27 15:05 command_out.txt
-rw-r--r--. 1 galaxy galaxy 648 Sep 27 15:05 runhello.sh
-rw-r--r--. 1 galaxy galaxy 347 Sep 27 15:05 tool_script.sh
[root@galaxy-arc-test-fresh ~]# ls -lhrt /storage/galaxy/data/jobs/000/284/working/2812507f8fca/gmlog/
total 176K
-rw-r--r--. 1 galaxy galaxy  781 Sep 27 15:05 local
-rw-r--r--. 1 galaxy galaxy 141K Sep 27 15:05 errors
-rw-r--r--. 1 galaxy galaxy 1.1K Sep 27 15:05 description
-rw-r--r--. 1 galaxy galaxy  317 Sep 27 15:05 diag
-rw-r--r--. 1 galaxy galaxy    8 Sep 27 15:05 status
-rw-r--r--. 1 galaxy galaxy 1.7K Sep 27 15:05 xml
-rw-r--r--. 1 galaxy galaxy    0 Sep 27 15:05 input
-rw-r--r--. 1 galaxy galaxy    2 Sep 27 15:05 output
-rw-r--r--. 1 galaxy galaxy   30 Sep 27 15:05 input_status
-rw-r--r--. 1 galaxy galaxy  367 Sep 27 15:05 statistics

That contains all the ARC logs, in addition to the jobs output files.

This version of the ARC runner depends on pyarcrest 0.3: https://pypi.org/search/?q=pyarcrest (temporarily uploaded for testing purposes, later it will be included in the nordugrid ARC distro http://www.nordugrid.org/arc/arc6/common/repos/repository.html)

How to test the changes?

(Select all options that apply)

To progress making this closer to production ready

  • How to sort out the path-rewrites. In ARC all input and output files need to be tagged as such, and from Galaxy I can only identify the input and output files from `job_wrapper.get_job().get_input_datasets() and .get_output_datasets(). But handling reference data that lives on the galaxy server - if those files can not be retrieved from the job_wrapper then ARC will not work for those tools.
  • Understand what is necessary in Galaxy for using the inbuilt ARC datastaging capabilities to the fullest
    • If given the URI of a file, ARC will download this file to the remote server before the job starts running on the servers compute nodes - if the file was already downloaded before (by the user or others) it will be in the cache and no download is necessary.
  • Discussion with Marius related to solving complex output resolutions in galaxy. A little snippet of that is here:
Screenshot 2023-09-27 at 16 54 59 Screenshot 2023-09-27 at 16 52 38

License

  • I agree to license these and all my past contributions to the core galaxy codebase under the MIT license.

maikenp and others added 25 commits September 1, 2023 14:12
Adding pyarcrest conditional requirement for the ArcRESTJobRunner
ArcRESTJobRunner
Require version 0.2 which is compatible with the ARC job runner.
…cessary job_actions method. Further improvements to the action on jobs will come.
@github-actions github-actions bot added this to the 23.2 milestone Sep 27, 2023
@maikenp maikenp marked this pull request as draft September 27, 2023 18:39
@mvdbeek mvdbeek modified the milestones: 23.2, 24.0 Dec 19, 2023
return
""" prepare_job() calls prepare() but not allowing to pass a compute_environment object
As I need to define my own compute_environment for the remote compute I must call it here passing the compute_environment
TODO - not a good solution"""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm facing a similar problem with the DIRAC jobrunner and thought of modifying the prepare_job to take an extra argument compute_environment (default None), so it should keep working with existing implementations but allow for customisation

@volodymyrss
Copy link
Contributor

Thanks for working on this, we need it so much in our astroparticle physics domains!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants