Processes (steps) of a pipline are executed independently of each other and communicate via channels (asynchronous FIFO queues).
Each process can have one or more channels as input and output.
A single nextflow script contains each process with inputs & outputs for each. The order the processes are written does not determine the order the processes are executed.
Communication link between processes sets the order in which processes are exercuted.
The configeration title defined the target execution platform (local computer, HPC or cloud).
submit with screen or tmux:
screen -S nextflow -m bash -c 'sh run.sh; exec sh'
https://github.com/chris-cheshire/linux-cheat-sheet
submit nextflow pipelines as an sbatch:
mlnf # alias ml Nextflow/19.10.0 Singularity/2.6.0-foss-2016b Graphviz
cd $baseDir
sbatch -N 1 -c 4 --mem=32G -t 48:00:00 --wrap="nextflow run main-test.nf -resume" --mail-type=ALL,ARRAY_TASKS --mail-user=oliver.ziff@crick.ac.uk
https://nf-co.re/usage/configuration
https://github.com/nf-core/configs/blob/master/docs/crick.md
https://www.dropbox.com/s/fv4uucjxm4ehckj/Phil%20Ewels%20-%20nf-core%20London%202020%20tips%20tricks.pdf?dl=0
mlnf
cd /camp/home/ziffo/home/projects/astrocyte-fractionation-bulk-rnaseq/nf-core-rnaseq
sbatch -N 1 -c 4 --mem=32G -t 48:00:00 --mail-type=ALL,ARRAY_TASKS --mail-user=oliver.ziff@crick.ac.uk --wrap=\
"nextflow run nf-core/rnaseq \
-profile crick \
--reads '/camp/home/ziffo/home/projects/astrocyte-fractionation-bulk-rnaseq/reads/*_R{1,2}.fastq.gz' \
--fasta '/camp/home/ziffo/home/genomes/ensembl/Homo_sapiens.GRCh38.dna.primary_assembly.fa' \
--transcript_fasta '/camp/home/ziffo/home/genomes/ensembl/Homo_sapiens.GRCh38.cdna.all.fa' \
--gtf '/camp/home/ziffo/home/genomes/annotation/Homo_sapiens.GRCh38.99.gtf' \
--star_index '/camp/home/ziffo/home/genomes/ensembl/GRCh38.99_ensembl_STAR_index' \
--saveReference \
--reverseStranded \
--aligner star --saveAlignedIntermediates \
--pseudoaligner salmon \
--outdir '/camp/home/ziffo/home/projects/astrocyte-fractionation-bulk-rnaseq/nextflow' \
-with-singularity /camp/apps/misc/stp/babs/nf-core/singularity/rnaseq/1.4.2/nfcore-rnaseq-1.4.2.img \
-resume"
10292126
with additions: VASTools, kallisto
sbatch -N 1 -c 4 --mem=32G -t 48:00:00 --mail-type=ALL,ARRAY_TASKS --mail-user=oliver.ziff@crick.ac.uk --wrap=\
"nextflow run rnaseq-master/main.nf \
-profile crick \
--reads '/camp/home/ziffo/home/projects/nextflow-test/reads/*_R{1,2}.fastq.gz' \
--fasta '/camp/home/ziffo/home/genomes/sequences/human/Homo_sapiens.GRCh38.dna.alt.fa' \
--transcript_fasta '/camp/home/ziffo/home/genomes/index/kallisto/Homo_sapiens.GRCh38.cdna.all.fa' \
--gtf '/camp/home/ziffo/home/genomes/annotation/Homo_sapiens.GRCh38.99.gtf' \
--saveReference \
--reverseStranded \
--aligner star --saveAlignedIntermediates \
--pseudoaligner salmon \
--outdir '/camp/home/ziffo/home/nextflow/nf-core/rnaseq' \
-with-singularity /camp/apps/misc/stp/babs/nf-core/singularity/rnaseq/1.4.2/nfcore-rnaseq-1.4.2.img \
-resume"
-
Fork & Clone github page
Clone it to the nf-core working directory on CAMP -
Edit the Conda env by adding to the environment.yml file
add kallisto & vcftools bioconda packages & specify versions:
- kallisto=0.46.2
- vcftools=0.1.16
- Edit the Dockerfile
add vast-tools with
RUN git clone https://github.com/vastgroup/vast-tools.git && cd vast-tools/ && echo -ne "y\Hsa" | ./install.R
ENV PATH ~/bin/vast-tools:$PATH
RUN echo 'export PATH=~/bin/vast-tools:$PATH' >> ~/.bashrc
These commands are from https://github.com/vastgroup/vast-tools
- Build Docker Image:
Make a login account on docker hub online https://hub.docker.com/?ref=login
Ask Chris Cheshire to add you to luslab on docker
Install Docker on mac https://hub.docker.com/?ref=login NB big file 675MB!
Log into your account on luslab on docker desktop
Open Terminal on local Mac
mkdir then cd bioinformatics/docker/rnaseq
rsync -aP camp-ext:/camp/home/ziffo/home/nextflow/nf-core/rnaseq/rnaseq/environment.yml .
rsync -aP camp-ext:/camp/home/ziffo/home/nextflow/nf-core/rnaseq/rnaseq/Dockerfile .
Use Chris cheat sheet for commands on how to build the docker image: https://github.com/chris-cheshire/linux-cheat-sheet
docker build -t luslab/oli-nfcore-rnaseq-dev:1.0 /Users/ziffo/bioinformatics/docker/rnaseq # build new image from docker file
docker images # View local images
docker run -it luslab/oli-nfcore-rnaseq-dev:1.0 /bin/bash # Run a container with an interactive shell
docker push luslab/oli-nfcore-rnaseq-dev:1.0 # Pushes local image to luslab docker hub
- Edit
main.nf
on github repro luslab/oz-bulk-rnaseq
Add processes at the end of script
edit STAR memory
edit vast-tools merge file & point main.nf vast-tools process to this
commit changes & push to Github on Github desktop
- Pull the docker .img (container) with singularity to the appropriate folder
https://singularity.lbl.gov/docs-pull
mlnf cd /camp/home/ziffo/home/nextflow/singularity sbatch -N 1 -c 4 --mem=32G -t 48:00:00 --mail-type=ALL,ARRAY_TASKS --mail-user=oliver.ziff@crick.ac.uk --wrap="singularity pull --name luslab-oli-nfcore-rnaseq-dev-1.0.img docker://luslab/oli-nfcore-rnaseq-dev:1.0"
sbatch -N 1 -c 4 --mem=32G -t 48:00:00 --mail-type=ALL,ARRAY_TASKS --mail-user=oliver.ziff@crick.ac.uk --wrap="docker pull luslab/oli-nfcore-rnaseq-dev:1.0"
- Run the Nextflow command as an sbatch:
mlnf cd /camp/home/ziffo/home/projects/astrocyte-fractionation-bulk-rnaseq/oli-nfcore-rnaseq-dev nextflow pull ojziff/rnaseq ## UPDATE PIPLINE
## RUN PIPELINE sbatch -N 1 -c 4 --mem=32G -t 48:00:00 --mail-type=ALL,ARRAY_TASKS --mail-user=oliver.ziff@crick.ac.uk --wrap=
"nextflow run ojziff/rnaseq
-with-singularity /camp/home/ziffo/home/nextflow/singularity/luslab-oli-nfcore-rnaseq-dev-1.0.img
-profile crick
--reads '/camp/home/ziffo/home/projects/astrocyte-fractionation-bulk-rnaseq/reads/*_R{1,2}.fastq.gz'
--reverseStranded
--fasta '/camp/home/ziffo/home/genomes/sequences/human/Homo_sapiens.GRCh38.dna.alt.fa'
--transcript_fasta '/camp/home/ziffo/home/genomes/index/kallisto/Homo_sapiens.GRCh38.cdna.all.fa'
--gtf '/camp/home/ziffo/home/genomes/annotation/Homo_sapiens.GRCh38.99.gtf'
--bed12 '/camp/home/ziffo/home/genomes/annotation/Homo_sapiens.GRCh38.99.bed' \ --saveReference
--aligner star --saveAlignedIntermediates
--pseudoaligner salmon
--outdir '/camp/home/ziffo/home/projects/astrocyte-fractionation-bulk-rnaseq/oli-nfcore-rnaseq-dev'
--email oliver.ziff@crick.ac.uk \
-resume"
export NXF_WORK="/camp/home/ziffo/home/nextflow/projects/scrnaseq-repro"
Containers & images are huge so need to be pruned regularly.
docker container prune # removes all stopped containers
docker image rm * # removes all images - replace * with image ID to remove individual IDs
https://github.com/nf-core/scrnaseq/blob/master/main.nf
mlnf
cd /camp/home/ziffo/home/nextflow/nf-core/scrnaseq
sbatch -N 1 -c 4 --mem=32G -t 48:00:00 --wrap="nextflow run nf-core/scrnaseq \
--reads '/camp/stp/babs/outputs/gandhi-patani/doaa.taha/fastq_pooled/**/*_R{1,2}_001.fastq.gz' \
--fasta '/camp/home/ziffo/home/genomes/sequences/human/Homo_sapiens.GRCh38.dna.alt.fa' \
--gtf '/camp/home/ziffo/home/genomes/annotation/Homo_sapiens.GRCh38.99.gtf' \
--transcriptome_fasta '/camp/home/ziffo/home/genomes/index/kallisto/Homo_sapiens.GRCh38.cdna.all.fa' \
--save_reference \
--email oliver.ziff@crick.ac.uk \
-profile crick" --mail-type=ALL,ARRAY_TASKS --mail-user=oliver.ziff@crick.ac.uk
-with-singularity /camp/home/ziffo/home/nextflow/nf-core/scrnaseq/work/singularity/nfcore-scrnaseq-1.0.0.img
singularity pull --name nfcore-scrnaseq-1.0.0.img docker://nfcore/scrnaseq:1.0.0
-profile
nextflow run main-test.nf -resume
crick profile config
https://github.com/nf-core/configs/blob/master/conf/crick.config
more .nextflow.log
more .command.sh
March 2020
Define processes within the pipeline.
params.str = 'Hello world!'
process splitLetters {
output: <span class="token function">file</span> <span class="token string">'chunk_*'</span> into letters <span class="token string">""</span><span class="token string">"
printf '${params.str}' | split -b 6 - chunk_ """ }
process convertToUpper {
input: <span class="token function">file</span> x from letters.flatten<span class="token punctuation">(</span><span class="token punctuation">)</span> output: stdout result <span class="token string">""</span><span class="token string">"
cat $x | tr '[a-z]' '[A-Z]' """ }
result.view { it.trim() }
Save script as filename.nf
nextflow run filename.nf
It will output something similar to:
N E X T F L O W ~ version 19.04.0
executor > local (3)
[69/c8ea4a] process > splitLetters [100%] 1 of 1 ✔
[84/c8b7f1] process > convertToUpper [100%] 2 of 2 ✔
HELLO
WORLD!
Processes are executed in parallel where possible.
The [69/c8ea4a] numbers refer to each unique process. These are the prefix of the directories where each process is execues. You can inspect them in $PWD/work
.
## RUN PIPELINE
nextflow run nf-core/atacseq \
-r 1.1.0 \
--input "design-atac.csv" \
-profile crick \
-with-singularity /camp/apps/misc/stp/babs/nf-core/singularity/atacseq/1.1.0/nfcore-atacseq-1.1.0.img \
--email [oliver.ziff@crick.ac.uk](mailto:chris.cheshire@crick.ac.uk) \
--genome BDGP6 \
--save_reference \
-resume
Only processes that are changed are re-excecuted. For processes not changed, the process cached result is used which saves time rerunning the whole pipeline.
After editing re-execute the script with:
nextflow run tutorial.nf -resume
It will print output similar to this:
N E X T F L O W ~ version 19.04.0
executor > local (2)
[69/c8ea4a] process > splitLetters [100%] 1 of 1, cached: 1 ✔
[d0/e94f07] process > convertToUpper [100%] 2 of 2 ✔
olleH
!dlrow
$PWD/work
caches all results which takes up a lot of space - clean the folder regularly.
Set parameters with --
then parameter name
then parameter
nextflow run tutorial.nf --str 'Bonjour le monde'
nextflow.config
file needs to be in the pipeline execution directory. The file defines which executor to use; environmental variables; pipeline parameters.
nextflow pull nf-core/atacseq
Course notes for DAY 1 https://www.seqera.io/training/day1/index.html
2nd March 2020
ssh ubuntu@ec2-34-253-38-90.eu-west-1.compute.amazonaws.com
Singularity
Docker containers
Exercutors: cloud or own infrastructure
Error recovery: resume jobs
New features with DSL 2.0: modular components
Automation - push the button pipeline
embarrassingly parellelisation: spawn 100s of jobs over distributed cluster
mash up different tools & scripts (dependencies)
dependency trees can be very complex: processes (circles) connected by arrows. E.g. 70 process tasks.
Reproducability in computational biology - different libraries, different softwares & execution environment, rounding errors
Challenges: reproducible; portable; scalable; usable; traceable.
Fundamentals for scaleable workflows:
takes code in any language: R, bash
define software depndecies: conda, docker
define dataflow programme
version control: github
deploy in a cloud: google cloud, aws, platform computing
Fast protocotyping: custom DSL. Enables tasks compositions. Written in Groovy (like python for java).
Easy parallelisation: what is in channels
Self contained: each task is independent- can run separate sections of different environments.
Portable deployments
Input:
file ref.fa from genome_channel
file sample.fq from reads_ch
output:
file sample.bam into bam_ch
script:
“”"
bwa mem ref.fa sample.fq | samtools sort -o sample.bam
“”"
output of one process is the input of the next process.
input > splitting > tasks / map / task > filter > collect ? result
channel = contains elements
process =execute for each element. each exectuion = task.
implicit parallelism: will run process as many times as many files are delivered to it (with wildcard *)
Channel.fromFilePairs("*_{1,2}.fq")
( iPSC, [ipsc_1.fq, ipsc_2.fq])
local execution: nextflow - docker/singularity container - OS - local storage
centralised orchestration: nextflow command > each task becomes a cluster job.
distributed orchestration: single job to cluster, but nextflow will split jobs are required.
process {
executor
queue
Docker or Singularity (also Podman)
faster startup. lighter.
layers. can be built upon.
transparent - see how they are built
always use containers. makes things much easier.
CPU usage, time >> optimise pipeline
QC look at tasks. shows what takes longer.
DAG visualisation: shows the pipeline ordering.
Text editors: Atom, Visual studio code - has ssh client & nextflow highlight & docker
community effort to collect production ready analysis pipelines built with nextflow
Files are saved to directories for each task. the naming of each directory is noted in the nextflow output on terminal
Sub tasks filed are put into sub directories.
errors can be investigated by going to the work directory that has failed. get the directory ID from the nextflow output.
more .command.sh
nextflow run script.nf
ubuntu@ip-172-31-40-29:~/nf-course/day1$ nextflow run simple.nf
N E X T F L O W ~ version 20.01.0
Launching `simple.nf` [agitated_cantor] - revision: d5b8b09fe6
executor > local (3)
[a8/4a74ce] process > splitLetters [100%] 1 of 1 ✔
[81/bb9bde] process > convertToUpper [100%] 2 of 2 ✔
olleH
!dlrow
The working directories are a8/4a74ce
and 81/bb9bde
letters.flatten().view()
at the end of script shows directory structure. Helpfil to debug.View content of channel.
letters.flatten().dump()
prints contents of all channels
Nextflow scripts are written in Groovy language (like python). http://docs.groovy-lang.org/latest/html/groovy-jdk/java/util/List.html
println.
comments // (not #)
assign variables with =
e.g. x = 1
define local variables with def
lists are with [ list ]
list = [10,20,30]
access element of list with [ ] or list.get(0) # first element is 0 with groovy
maps are lists with any character (not just integers).
map = [a:0, b:1, c:2]
strings defined in single or double quotes
multiline string with tripple double quote
"""
Hello
Here here
"""
if statement:
if( < statement > ) { //true branch } else { //false branch }
if list: if( < statement > ) { println list } else { println 'THe list is empty' } println( list ?: "The list is empty" )
for loop syntax
for (int i = 0; i <3; i++) {
println("Hello World $i")
}
Closures: define a function like a variable.
square = { it * it }
square function
println square.call(5)
prints the result of 5 * 5
list = [1,2,3] list.each { println (it * it) }
z = list.collect { it * it) }
# saves result of closure to variable z
nextflow run script.nf -resume
avoids rerunning the whole script from scratch
params.greeting = “Hello World”
params.outdir = “some/path”
Change greeting: nextflow run simple.nf --greeting "Hola Mundo"
note --
not -
greeting.
-resume
only needs 1 -
greeting_ch = Channe;.from ()
Channels logically connect taks to each other.
Doesnt need to wait for implementation of other tasks. The task will run as soon as the input files are available.
Only 1 task can consume a channel.
2 types of channels:
- Queue = asynchronous unidirectional FIFO connecting 2 processing
- Value = singleton channel. Bound to single value. Can be read unlimited times. Not consumed.
creates a queue channel.
Uses wildcard to take in all files!
ch = Channel.fromPath('data/ggal/*.fq')
ch.view()
Can take in all subdirectories:
ch = Channel.fromPath('data/**/*.fq')
ch.view()
fromFilePairs handles pairs of files:
Channel
.fromFilePairs('/my/data/SRR*_{1,2}.fastq')
.println()
{1,2}
wildcard takes files ending 1 or 2.
Channel
.fromFilePairs('data/ggal/*_{1,2}.fq')
.println()
returns all fastq files from a single SRA ID
Channel
.fromSRA('SRP043510')
.view()
Script block is the command that runs the process. Each process has 1 script block.
Default is bash script but scan change to python, R eg
process example { script: """ blastp -db /data/blast -query query.fa -outfmt 6 > blast_result cat blast_result | head -n 10 | cut -f 2 > top_hits blastdbcmd -db /data/blast -entry_batch top_hits > sequences """ }
process pyStuff { script: """ #!/usr/bin/env python
x = 'Hello' y = 'world!' print "%s - %s" % (x,y) """ }
echo true in script
or add -process.echo true
set bash variables e.g. FASTQ = .fastq
in script escape with \
"""
X = 1
echo Hello \$X
"""
}
Default aligner is kallisto, but can change aligner to salmon with --aligner= 'salmon'
params.aligner = 'kallisto'
process foo { script: if( params.aligner == 'kallisto' ) """ kallisto --reads /some/data.fastq """ else if( params.aligner == 'salmon' ) """ salmon --reads /some/data.fastq """ else throw new IllegalArgumentException("Unknown aligner $params.aligner") }
syntax:
input:
<input qualifier> <input name> from <source channel>
val is the inpur qualifier
num = Channel.from( 1, 2, 3 )
process basicExample { input: val x from num
""" echo process job $x """ }
reads = Channel.fromPath( 'data/ggal/*.fq' )
process foo {
input:
file 'sample.fastq' from reads
script:
"""
echo your_command --reads sample.fastq
"""
}
See the links to these files in work directory:
tree work/
With tuple can use sample_id
which contains the metadata & actual data.
4 processes:
Index a transcritome file
QC fastqc
quantification
MultQC report
Index
params.reads = "$baseDir/data/ggal/*_{1,2}.fq"
params.transcriptome = "$baseDir/data/ggal/transcriptome.fa"
params.multiqc = "$baseDir/multiqc"
println "reads: $params.reads"
nexflow run script1.nf --reads
cat .nextflow.log
nextflow run script2.nf -with-docker
-with-docker brings in the container from the config:
process.container = 'nextflow/rnaseq-nf'
docker.runOptions='-u $(id -u):$(id -g)'
add to config to mean -with-docker
not needed every time
docker.enabled = true
tree work/
shows all files in directories in work/
$task.cpus
can be specified outside of the script. Makes it easy to change CPUs.
.collect
will group all files into a single working directory to run into a single channel
.mix
will combine channels
mkdir bin/
add to bin the long R script
chmod +x script.sh # make executable
nextflow will auto add bin to path
can remove script from nextflow & call external R script - give input.
repository
data/
docker
nextflow.config
main.nf
nextflow run nextflow-io/rnaseq-nf -with-docker
# pulls the code & docker
nextflow run -r v1.2
can be a version, revision, commit ID.
Docker client is a service that runs in the background. Finds image by name. If not on local machine it can pull it from remote Docker hub >> creates a container from the image >> runs command
NB security concerns with docker permissions
docker run -it debian:jessie
changes directory to the container
exit
to return to previous directory
Script to create image
FROM debian:jessie
MAINTAINER <your name>
RUN apt-get update && apt-get install -y curl cowsay
ENV PATH=$PATH:/usr/games/
RUN curl -sSL https://github.com/COMBINE-lab/salmon/releases/download/v1.0.0/salmon-1.0.0_linux_x86_64.tar.gz | tar xz
&& mv /salmon-/bin/ /usr/bin/
&& mv /salmon-/lib/ /usr/lib/
Aways build containers in empty directories.
mkdir docker code Dockerfile
tag image default is latest: Successfully tagged my-image:latest
Build the docker container:
docker build -t my-image .
Run interactively inside of container:
docker run -it my-image bash
Use container software & files to run a command. All files to run with container need to be accessible - mount the current file system into the container:
docker run --volume $HOME:$HOME --workdir -u $(id -u):$(id -g) $PWD my-image salmon index -t $PWD/data/ggal/transcriptome.fa -i transcript-index
-u $(id -u):$(id -g)
publish container publically to docker hub account: tag it with Docker username, then push
docker tag my-image <user-name>/my-image
docker push <user-name>/my-image
download a container with pull
docker pull <user-name>/my-image
nextflow run script2.nf -with-docker my-image
See the nextflow commands that were run:
cd /work/nextflow_ID_directory/
more .command.run
secure version of Docker for use on HPCs. permissions are from user - not from root.
Bootstrap: docker From: debian:jessie
%environment export PATH=$PATH:/usr/games/
%labels AUTHOR <your name>
%post
apt-get update && apt-get install -y locales-all curl cowsay curl -sSL https://github.com/COMBINE-lab/salmon/releases/download/v1.0.0/salmon-1.0.0_linux_x86_64.tar.gz | tar xz
&& mv /salmon-/bin/ /usr/bin/
&& mv /salmon-/lib/ /usr/lib/
Run file with:
sudo singularity build my-image.sif Singularity
Run container:
singularity exec my-image.sif cowsay 'Hello Singularity'
Run interactively from inside container:
singularity shell my-image.sif
exit
to return to previous directory
best to write images in docker rather than singularity. Singularity will format docker files automatically:
singularity pull docker://debian:jessie
run with nextflow
nextflow run script7.nf -with-singularity nextflow/rnaseq-nf
this creates a singularity directory if not specified
https://sylabs.io/
https://cloud.sylabs.io/home
Conda environment is defined in a YAML file:
# CONDA ENVIRONMENT FILE
name: nf-tutorial
channels:
- defaults
- bioconda
- conda-forge
dependencies:
# Default bismark
- salmon=0.8.2
- fastqc=0.11.5
- multiqc=1.5
create conda envrionment:
conda env create --file env.yml
activate conda environment:
conda activate nf-tutorial
run nextflow with conda:
nextflow run script7.nf -with-conda /home/ubuntu/miniconda2/envs/nf-tutorial
Best to put the conda environment inside the Dockerfile with COPY conda.yml:
FROM continuumio/miniconda:4.7.12
MAINTAINER Paolo Di Tommaso <paolo.ditommaso@gmail.com>
RUN apt-get -y install ttf-dejavu
COPY conda.yml . RUN
conda env update -n root -f conda.yml
&& conda clean -a
RUN apt-get install -y procps
This docker image has miniconda >> add conda environment
Nextflow looks for nextflow.config
in:
- current directory
- then in script base directory
- then
$HOME/.nextflow/config
propertyOne = 'world'
anotherProp = "Hello $propertyOne"
customPath = "$PATH:/my/app/folder"
access any variable defined in the host environment such as $PATH
, $HOME
, $PWD
, etc.
Add Comments:
// comment a single config file
/* a comment spanning multiple lines */
Define properties either with dot prefix:
alpha.x = 1
alpha.y = ‘string value…’
Or group together:
beta {
p = 2
q = ‘another string …’
}
Define params:
// workflow script params.foo = 'Hello' params.bar = 'world!'
// print the both params println "$params.foo $params.bar"
Can change params from the command line eg change foo to Hola
nextflow run params.nf --foo Hola
defines where workflow will be executed
env.ALPHA = 'some value'
env.BETA = "$HOME/some/path"
settings for the task execution such as cpus
, memory
, container
and other resources in the pipeline script:
process {
cpus = 10
memory = 8.GB
container = 'biocontainers/bamtools:v2.4.0_cv3'
}
better to define these in the config than the nextflow script
process { memory = { 4.GB * task.cpus } }
specify container image:
process.container = 'nextflow/rnaseq-nf@sha256:aeacbd7ea1154f263cda972a96920fb228b2033544c2641476350b9317dab266'
docker.enabled = true
or with sinuglarity
process.container = '/some/singularity/image.sif'
singularity.enabled = true
Use singularity to pull docker image:
process.container = 'nextflow/rnaseq-nf'
singularity.enabled = true
following protocols are supported:
library://
download the container image from the Singularity Library serviceshub://
download the container image from the Singularity Hub.docker://
download the container image from the Docker Hub and convert it to the Singularity format.docker-daemon://
pull the container image from a local Docker installation and convert it to a Singularity image file.
a Conda environment can be provided in the configuration file adding the following setting in nextflow.config
:
process.conda = "/home/ubuntu/miniconda2/envs/nf-tutorial"
path is to either the conda directory or the YAML file
To run your pipeline with a batch scheduler modify the nextflow.config
file specifying the target executor and the required computing resources if needed. For example:
process.executor = 'slurm'
specify the amount of resources i.e. cpus, memory, execution time, etc. required by each task. Use the scope process
to define the resource requirements for all processes in your workflow applications. For example:
process {
executor = 'slurm'
queue = 'short'
memory = '10 GB'
time = '30 min'
cpus = 4
}
Different tasks need different amount of computing resources. It is possible to define the resources for a specific task using the select withName:
followed by the process name:
process {
executor = ‘slurm’
queue = ‘short’
memory = ‘10 GB’
time = ‘30 min’
cpus = 4
withName: foo { cpus = 4 memory = '20 GB' queue = 'short' }
withName: bar { cpus = 8 memory = '32 GB' queue = 'long' }
}
When a workflow is composed by many processes it can be arduous listing all process names in the config to specify the resources for each of them. Better to annotate the processes with a label . Then specify the resources in the config using for all processes having the same label.
workflow script:
process task1 { label 'long'
""" first_command --here """ }
process task2 { label 'short'
""" second_command --here """ }
config:
process { executor = 'slurm'
withLabel: 'short' { cpus = 4 memory = '20 GB' queue = 'alpha' } withLabel: 'long' { cpus = 8 memory = '32 GB' queue = 'omega' }
}
different container for each process in your workflow:
process { withName: foo { container = 'some/image:x' } withName: bar { container = 'other/image:y' } }
docker.enabled = true
Configuration files can contain the definition of one or more profiles. A profile is a set of configuration attributes that can be activated/chosen when launching a pipeline execution by using the -profile
command line option. Configuration profiles are defined by using the special scope profiles
which group the attributes that belong to the same profile using a common prefix. For example:
profiles {
standard { params.genome = '/local/path/ref.fasta' process.executor = 'local' } cluster { params.genome = '/data/stared/ref.fasta' process.executor = 'sge' process.queue = 'long' process.memory = '10GB' process.conda = '/some/path/env.yml' } cloud { params.genome = '/data/stared/ref.fasta' process.executor = 'awsbatch' process.container = 'cbcrg/imagex' docker.enabled = true }
}
set different process configuration strategies depending on the target runtime platform. By convention the standard
profile is implicitly used when no other profile is specified by the user.
enable a specific profile use -profile
option followed by the profile name:
nextflow run <your script> -profile cluster
Nextflow manages independently workflow execution intermediate results from the pipeline expected outputs. Task output files are created in the task specific execution directory which is considered as a temporary directory that can be deleted upon completion. The pipeline result files need to be marked explicitly using the directive publishDir in the process that’s creating such file. For example:
process makeBams { publishDir "/some/directory/bam_files", mode: 'copy'
input: file index from index_ch tuple val(name), file(reads) from reads_ch output: tuple val(name), file ('*.bam') into star_aligned """ STAR --genomeDir $index --readFilesIn $reads """
}
this copies all bam files created by the star
task into /some/directory/bam_files
use more then one publishDir
to keep different outputs in separate directory
params.reads = 'data/reads/*_{1,2}.fq.gz' params.outdir = 'my-results'
Channel .fromFilePairs(params.reads, flat: true) .set{ samples_ch }
process foo { publishDir "$params.outdir/$sampleId/", pattern: '.fq' publishDir "$params.outdir/$sampleId/counts", pattern: "_counts.txt" publishDir "$params.outdir/$sampleId/outlooks", pattern: '*_outlook.txt'
input: set sampleId, file('sample1.fq.gz'), file('sample2.fq.gz') from samples_ch output: file "*" script: """ < sample1.fq.gz zcat > sample1.fq < sample2.fq.gz zcat > sample2.fq
awk '{s++}END{print s/4}' sample1.fq > sample1_counts.txt awk '{s++}END{print s/4}' sample2.fq > sample2_counts.txt head -n 50 sample1.fq > sample1_outlook.txt head -n 50 sample2.fq > sample2_outlook.txt
""" }
create an output structure in the directory my-results
, which contains a separate sub-directory for each given sample ID each of which contain the folders counts
and outlooks
Nextflow caching mechanism works assigning a unique ID to each task which is used to create a separate execution directory where the tasks are executed and the results stored. The task unique ID is generated as a 128-bit hash number obtained composing the task inputs values and files and the command string. View this structure with tree work/
The -resume
command line option allow the continuation of a pipeline execution since the last step that was successfully completed:
nextflow run <script> -resume
Nextflow uses the task unique ID to check if the work directory already exists and it contains a valid command exit status and the expected output files. If this condition is satisfied the task execution is skipped and previously computed results are used as the process results. The first task, for which a new output is computed, invalidates all downstream executions in the remaining DAG.
The task work directories are created in the folder work/
in the launching path by default. This is supposed to be a scratch storage area that can be cleaned up once the computation is completed. The final output are supposed to the stored in a different location specified using one or more publishDir directive.
A different location for the execution work directory can be specified using the command line option -w
e.g.
nextflow run <script> -w /some/scratch/dir