This repository documents how structural MRI brain imaging data for a large number of subjects is preprocessed using the sMRIprep pipeline, which is provided as a docker image. The approach here described scales up the preprocessing by using a scheduler that runs multiple Docker instances of the pipeline in parallel.
The documentation is divided into 5 sections:
- Downloading data
- Setting up Docker
- Setting up Task Spooler
- Monitoring & error handling
- Retrieving preprocessed data
Besides the documentation, this repository also contains useful shell scripts. The procedures and tools described here have been applied to T1w images from more than 3600 subjects originating from the HBN dataset. This how-to will use the HBN dataset as an example case (referred to as the HBN project).
For functional MRI data, the closely related fMRIprep pipeline can be used. It should be possible to apply the same procedures and tools described here.
The MRI and EEG data of the HBN dataset is stored on an AWS S3 bucket, where it is organized into folders by participant. All MRI data is compliant with the BIDS specification.
To visually inspect the data on the S3 bucket a GUI-based file-browser like Cyberduck can be used. For the actual download, it is more convenient to use a cli-based tool like Rclone.
In the case of the HBN project, the MRI data was downloaded to the directory /scratch/hbnetdata/MRI
.
After installation, first create a new configuration for a remote resource with the command rclone config
and then select option n) New remote
. For the HBN S3 bucket, the following configuration details need to be specified (the name can be chosen arbitrarily):
name = remote
type = s3
provider = AWS
region = us-east-1
acl = private
endpoint = s3.amazonaws.com
To copy files from the remote resource, use the command rclone copy
. This command can be used with the filtering option --filter-from
to link to a filter file. This file is a simple text file where patterns to be excluded are indicated by lines starting with -
, and patterns to be included by lines starting with +
.
# Rclone copy
rclone copy remote:/fcp-indi/data/Projects/HBN/MRI/Site-SI /scratch/hbnetdata/MRI/
# Rclone copy with --filter-from option
rclone copy --filter-from=~/filter.txt remote:/fcp-indi/data/Projects/HBN/MRI/Site-SI /scratch/hbnetdata/MRI/
In the case of the HBN project, the shell script download_data.sh
was used to start 4 concurrent rclone copy
threads, one for each site.
The sMRIprep pipeline is provided as a Docker image. In order to run the pipeline, Docker needs to be installed. In some cases, primarily due to security reasons, the rootless version of Docker needs to be installed. In that case, it might be necessary to change the proxy settings of the default configuration of Docker so that it can connect to the internet, which is required for pulling images from Docker Hub.
Once Docker is running, the image of the pipeline needs to be downloaded with the command docker pull nipreps/smriprep:latest
. Then the pipeline can be started as a Docker container instance using the docker run
command:
docker run --rm \
-v /scratch/hbnetdata/MRI:/data:ro \
-v /scratch/hbnetdata/derivatives:/output \
-v /scratch/hbnetdata/work:/work \
-v $HOME/license.txt:/opt/freesurfer/license.txt \
nipreps/smriprep:latest \
/data /output \
participant \
--participant-label sub-NDARxxxxxxxx \
--nprocs 4 \
--omp-nthreads 8 \
--mem-gb 8 \
--work-dir /work \
--output-spaces MNIPediatricAsym:cohort-2
The -v
option for the docker run
command is used to bind-mount a volume. This allows the container to access directories outside of it. For example, -v /scratch/hbnetdata/MRI:/data:ro
implies that the folder /MRI
outside the container is mapped to the folder data
within the container. The suffix :ro
denotes that the folder /data
is read-only.
The lines following nipreps/smriprep:latest \
are commands and arguments for the actual container instance, i.e. the pipeline. There are 3 positional arguments that need to be specified: bids_dir
, output_dir
and analysis_level
. Here the bids_dir
is set to /data
, which is mapped to /MRI
containing the raw participant data. The output_dir
is set to /output
, which is mapped to /derivatives
. The analysis_level
is set to participant
, which is the default preprocessing mode.
A container instance should only be assigned one single subject. Assigning multiple subjects results in performance drops and a longer duration per subject, and it can even lead to blocking the entire pipeline. The latter is the case when one subject in the queue takes extremely long to preprocess or does not terminate at all.
The options --nprocs
, --omp-nthreads
and --mem-gb
specify what hardware resources are allocated to the container. As part of the HBN project, 4 (logical) CPU cores, 8 threads and 8 GB of RAM were used. For future projects, configurations with more threads (16 or 32) should also be compared.
A working directory can be specified with the --work-dir
option and also needs to be bind-mounted beforehand with the -v
option as part of the docker run
command. In practice, when working with a large number of subjects to be preprocessed, the working directory grows rapidly in size and should be emptied periodically.
Please consult the sMRIprep usage page for more details on additional arguments.
Useful Docker commands and options:
# Get system-wide information
docker info
# Automatically remove the container when it exits
docker run --rm
# Allocate a pseudo-TTY connected to the container's stdin; creating an interactive `bash` shell in the container.
docker run --it
# Run container in background and print container ID; results in stdout stream from container not being shown in shell
docker run --detach
# List containers
docker ps
docker ps -f "status=exited"
# Display a live stream of container(s) resource usage statistics
docker stats
# Stop and remove container
docker container stop [OPTIONS] CONTAINER [CONTAINER...]
docker rm [OPTIONS] CONTAINER [CONTAINER...]
# Remove image
docker image rm [OPTIONS] IMAGE [IMAGE...]
Task Spooler is a light-weight scheduler for UNIX-based systems that can be used to run multiple container instances of the sMRIprep pipeline in parallel. After installation, use the command ts
to start the scheduler.
In the case of the HBN project, the workstation provided had a CPU with 128 cores (equaling 256 CPU threads when Hyper-Threading is activated) and 1024 GB of RAM. Given that a single container requires 4 CPU threads, it is possible to run 64 instances in parallel. To ensure a safety buffer, 60 jobs instead of 64 were chosen.
A new job can be added to the queue with the command ts [command]
. As soon as a single job finishes, a new one is fetched from the queue and executed until the queue is empty. An individual job terminates with an exit code. Exit code0
indicates a successful termination, 1
indicates termination with error, and -119
is returned when the job is manually stopped. The latter is the case when using the command docker stop [container]
to stop a container instance and, as a result, the superordinate scheduler job.
The shell script add-jobs.sh
was used to add jobs to the queue automatically. The script iterates over a list of subject-ids and executes ts docker run [...]
for every id.
Useful Task Spooler commands:
# Start task spooler application and/or show overview
ts
# Specify number of jobs running in parallel
ts -S 60
# Get information for specific job with id
ts -i 1
# Clear list of finished tasks (does not reset index)
ts -C
# Kill ts scheduler (does reset index when starting ts again)
ts -K
The preprocessing operation can be monitored using the ts
command, which shows a list consisting of jobs currently running and those still left in the queue. For details on specific jobs, use the ts -i
command. In addition to that, the docker ps
command can be used to list all active containers and see how long they have been running.
While a single subject took on average 8-10 hours to preprocess with the given hardware as part of the HBN project, some subjects can take very long (multiple days) to finish, which results in blocking computing resources for a longer period of time and slowing down the overall operation. The reason for this is low-quality imaging data, which can be related to head motion or other acquisition-related factors. To mitigate this problem, one can introduce a policy that can be implemented in a manual or automatic way.
One such policy can be to discard jobs that have been running for more than a fixed amount of time. One can use the docker ps
command to detect affected containers and discard them using the docker stop [container]
command, resulting in the corresponding job terminating with an exit code of -119
. This procedure can be automated with a cron job.
Other useful shell commands:
# Find all subjects that have (smriprep report) html file and store ids in txt file
ls *.html | awk -F '.' '{print $1}' | awk -F '-' '{print $2}' > html-subjects.txt
# Delete subject directories given file with subject ids
cat subjects_2_exclude.txt | awk '{print "/scratch/hbnetdata/MRI/sub-" $0}' | xargs -I{} rm -r{}
# List subjects where ts job terminated with return code 1 (assuming jobs are listed in ts queue)
ts | grep -E '^.*\s{3}1\s{3}.*$' | awk -F ' ' '{print $1}' | xargs -I{} ts -i {} | grep -oP '(--participant-label )\w+' | awk -F ' ' '{print $2}'
The derivatives for all subjects are stored in two folders: freesurfer and smriprep.
FreeSurfer-related statistics can be generated as csv files with the shell script fs_stats_generation.sh
. The files generated by the script can be combined into one single dataframe using the notebook merge_fs_stats_files.ipynb
.
Imaging files that are located within the smriprep folder can be copied to a specified target location using the script copy_images.sh
. To check if the copied files are intact, use the integrity_check.sh
script.