-
Preparatory steps
- If you are not connected to the SWC network: connect to the SWC VPN.
-
Connect to the SWC HPC cluster
ssh <SWC-USERNAME>@ssh.swc.ucl.ac.uk ssh hpc-gw1
It may ask for your password twice. To set up SSH keys for the SWC cluster, see this guide.
-
Download the evaluation script from the 🦀 repository
To do so, run the following command, which will download a bash script called
run_evaluation_array.sh
to the current working directory.curl https://raw.githubusercontent.com/SainsburyWellcomeCentre/crabs-exploration/main/bash_scripts/run_evaluation_array.sh > run_evaluation_array.sh
This bash script launches a SLURM array job that evaluates an array of trained models. The version of the bash script downloaded is the one at the tip of the
main
branch in the 🦀 repository.
Tip
To retrieve a version of the file that is different from the file at the tip of main
, edit the remote file path in the curl
command:
- For example, to download the version of the file at the tip of a branch called
<BRANCH-NAME>
, edit the path above to replacemain
with<BRANCH-NAME>
:https://raw.githubusercontent.com/SainsburyWellcomeCentre/crabs-exploration/<BRANCH-NAME>/bash_scripts/run_evaluation_array.sh
- To download the version of the file of a specific commit, replace
main
withblob/<COMMIT-HASH>
:https://raw.githubusercontent.com/SainsburyWellcomeCentre/crabs-exploration/blob/<COMMIT-HASH>/bash_scripts/run_evaluation_array.sh
-
Edit the evaluation bash script if required
Ideally, we won't make major edits to the bash scripts. If we find we do, then we may want to consider moving the relevant parameters to the config file, or making them a CLI argument.
When launching an array job, we may want to edit the following variables in the bash script:
- The
MLFLOW_CKPTS_FOLDER
and theCKPT_FILENAME
variables, define which trained models we would like to evaluate. See the examples in the bash script comments for the syntax. - The number of trained models to evaluate needs to match the number of jobs in the array. To change the number of jobs in the array job, edit the line that starts with
#SBATCH --array=0-n%m
and setn
to the total number of jobs minus 1. The variablem
refers to the number of jobs that can be run at a time. - The
MLFLOW_FOLDER
. By default, we point to the "scratch" folder at/ceph/zoo/users/sminano/ml-runs-all/ml-runs-scratch
. This folder holds runs that we don't need to keep. For runs we would like to keep, we will instead point to the folder at/ceph/zoo/users/sminano/ml-runs-all/ml-runs
.
Less frequently, one may need to edit:
- The
EVALUATION_SPLIT
, to select whether we want to evaluate in thetest
set or thevalidation
set. - the
GIT_BRANCH
, if we want to use a specific version of the 🦀 package. Usually we will run the version of the 🦀 package inmain
.
Note that the dataset information and the config file used are retrieved from the MLflow logs of the corresponding training job. The experiment name for MLflow is created as the experiment name of the training job with the suffix
_evaluation
.If we would like a specific config file to be used (for example, to increase the number of workers in the evaluation), we can create a new config file and pass it to the evaluate command using the
--config_file
flag. - The
Caution
If we launch a job and then modify the config file before the job has been able to read it, we may be using an undesired version of the config in our job! To avoid this, it is best to wait until you can verify in MLflow that the job has the expected config parameters (and then edit the file to launch a new job if needed).
-
Run the evaluation array job using the SLURM scheduler
To launch the evaluation array job, use the
sbatch
command and pass the path to the script:sbatch <path-to-run-evaluation-array-script>
-
Check the status of the evaluation job
To do this, we can:
-
Check the SLURM logs: these should be created automatically in the directory from which the
sbatch
command is run. -
Run supporting SLURM commands (see below).
-
Check the MLFlow logs. To do this, first create or activate an existing conda environment with
mlflow
installed, and then run themlflow
command from the login node.-
Create and activate a conda environment.
module load miniconda conda create -n mlflow-env python=3.10 mlflow -y conda activate mlflow-env
-
Run
mlflow
to visualise the results logged to theml-runs
folder.-
If using the "scratch" folder:
mlflow ui --backend-store-uri file:////ceph/zoo/users/sminano/ml-runs-all/ml-runs-scratch
-
If using the selected runs folder:
mlflow ui --backend-store-uri file:////ceph/zoo/users/sminano/ml-runs-all/ml-runs
-
-
-
To check the status of your jobs in the queue
squeue -u <username>
To show details of the latest jobs (including completed or cancelled jobs)
sacct -X -u <username>
To specify columns to display use --format
(e.g., Elapsed
)
sacct -X --format="JobID, JobName, Partition, Account, State, Elapsed" -u <username>
To check specific jobs by ID
sacct -X -j 3813494,3813184
To check the time limit of the jobs submitted by a user (for example, sminano
)
squeue -u sminano --format="%i %P %j %u %T %l %C %S"
To cancel a job
scancel <jobID>