Skip to content

Latest commit

 

History

History
945 lines (890 loc) · 53.3 KB

nextflow.md

File metadata and controls

945 lines (890 loc) · 53.3 KB


Nextflow

Concepts

Processes (steps) of a pipline are executed independently of each other and communicate via channels (asynchronous FIFO queues).
Each process can have one or more channels as input and output.
A single nextflow script contains each process with inputs & outputs for each. The order the processes are written does not determine the order the processes are executed.
Communication link between processes sets the order in which processes are exercuted.
The configeration title defined the target execution platform (local computer, HPC or cloud).

Cluster

submit with screen or tmux:
screen -S nextflow -m bash -c 'sh run.sh; exec sh'

https://github.com/chris-cheshire/linux-cheat-sheet

submit nextflow pipelines as an sbatch:

mlnf # alias ml Nextflow/19.10.0  Singularity/2.6.0-foss-2016b Graphviz
cd $baseDir
sbatch -N 1 -c 4 --mem=32G -t 48:00:00 --wrap="nextflow run main-test.nf -resume" --mail-type=ALL,ARRAY_TASKS --mail-user=oliver.ziff@crick.ac.uk

nf-core

https://nf-co.re/usage/configuration
https://github.com/nf-core/configs/blob/master/docs/crick.md
https://www.dropbox.com/s/fv4uucjxm4ehckj/Phil%20Ewels%20-%20nf-core%20London%202020%20tips%20tricks.pdf?dl=0

nf-core/rnaseq

mlnf
cd /camp/home/ziffo/home/projects/astrocyte-fractionation-bulk-rnaseq/nf-core-rnaseq
sbatch -N 1 -c 4 --mem=32G -t 48:00:00 --mail-type=ALL,ARRAY_TASKS --mail-user=oliver.ziff@crick.ac.uk --wrap=\
"nextflow run nf-core/rnaseq \
	-profile crick \
	--reads '/camp/home/ziffo/home/projects/astrocyte-fractionation-bulk-rnaseq/reads/*_R{1,2}.fastq.gz' \
	--fasta '/camp/home/ziffo/home/genomes/ensembl/Homo_sapiens.GRCh38.dna.primary_assembly.fa' \
	--transcript_fasta '/camp/home/ziffo/home/genomes/ensembl/Homo_sapiens.GRCh38.cdna.all.fa' \
	--gtf '/camp/home/ziffo/home/genomes/annotation/Homo_sapiens.GRCh38.99.gtf' \
	--star_index '/camp/home/ziffo/home/genomes/ensembl/GRCh38.99_ensembl_STAR_index' \
	--saveReference \
	--reverseStranded \
	--aligner star --saveAlignedIntermediates \
	--pseudoaligner salmon \
	--outdir '/camp/home/ziffo/home/projects/astrocyte-fractionation-bulk-rnaseq/nextflow' \
	-with-singularity /camp/apps/misc/stp/babs/nf-core/singularity/rnaseq/1.4.2/nfcore-rnaseq-1.4.2.img \
	-resume"

10292126

with additions: VASTools, kallisto

sbatch -N 1 -c 4 --mem=32G -t 48:00:00 --mail-type=ALL,ARRAY_TASKS --mail-user=oliver.ziff@crick.ac.uk --wrap=\
"nextflow run rnaseq-master/main.nf \
	-profile crick \
	--reads '/camp/home/ziffo/home/projects/nextflow-test/reads/*_R{1,2}.fastq.gz' \
	--fasta '/camp/home/ziffo/home/genomes/sequences/human/Homo_sapiens.GRCh38.dna.alt.fa' \
	--transcript_fasta '/camp/home/ziffo/home/genomes/index/kallisto/Homo_sapiens.GRCh38.cdna.all.fa' \
	--gtf '/camp/home/ziffo/home/genomes/annotation/Homo_sapiens.GRCh38.99.gtf' \
	--saveReference \
	--reverseStranded \
	--aligner star --saveAlignedIntermediates \
	--pseudoaligner salmon \
	--outdir '/camp/home/ziffo/home/nextflow/nf-core/rnaseq' \
	-with-singularity /camp/apps/misc/stp/babs/nf-core/singularity/rnaseq/1.4.2/nfcore-rnaseq-1.4.2.img \
	-resume"

Edit nf-core pipeline

  1. Fork & Clone github page
    Clone it to the nf-core working directory on CAMP

  2. Edit the Conda env by adding to the environment.yml file
    add kallisto & vcftools bioconda packages & specify versions:

  - kallisto=0.46.2
  - vcftools=0.1.16
  1. Edit the Dockerfile
    add vast-tools with
RUN git clone https://github.com/vastgroup/vast-tools.git && cd vast-tools/ &&  echo -ne "y\Hsa" | ./install.R
ENV PATH ~/bin/vast-tools:$PATH
RUN echo 'export PATH=~/bin/vast-tools:$PATH' >> ~/.bashrc

These commands are from https://github.com/vastgroup/vast-tools

  1. Build Docker Image:

Make a login account on docker hub online https://hub.docker.com/?ref=login
Ask Chris Cheshire to add you to luslab on docker
Install Docker on mac https://hub.docker.com/?ref=login NB big file 675MB!
Log into your account on luslab on docker desktop
Open Terminal on local Mac
mkdir then cd bioinformatics/docker/rnaseq

rsync -aP camp-ext:/camp/home/ziffo/home/nextflow/nf-core/rnaseq/rnaseq/environment.yml .
rsync -aP camp-ext:/camp/home/ziffo/home/nextflow/nf-core/rnaseq/rnaseq/Dockerfile .

Use Chris cheat sheet for commands on how to build the docker image: https://github.com/chris-cheshire/linux-cheat-sheet

docker build -t luslab/oli-nfcore-rnaseq-dev:1.0 /Users/ziffo/bioinformatics/docker/rnaseq # build new image from docker file
docker images  # View local images
docker run -it luslab/oli-nfcore-rnaseq-dev:1.0 /bin/bash  # Run a container with an interactive shell
docker push luslab/oli-nfcore-rnaseq-dev:1.0  # Pushes local image to luslab docker hub 
  1. Editmain.nf on github repro luslab/oz-bulk-rnaseq

Add processes at the end of script
edit STAR memory
edit vast-tools merge file & point main.nf vast-tools process to this
commit changes & push to Github on Github desktop

  1. Pull the docker .img (container) with singularity to the appropriate folder
    https://singularity.lbl.gov/docs-pull
mlnf
cd /camp/home/ziffo/home/nextflow/singularity
sbatch -N 1 -c 4 --mem=32G -t 48:00:00 --mail-type=ALL,ARRAY_TASKS --mail-user=oliver.ziff@crick.ac.uk --wrap="singularity pull  --name luslab-oli-nfcore-rnaseq-dev-1.0.img docker://luslab/oli-nfcore-rnaseq-dev:1.0" 

sbatch -N 1 -c 4 --mem=32G -t 48:00:00 --mail-type=ALL,ARRAY_TASKS --mail-user=oliver.ziff@crick.ac.uk --wrap="docker pull luslab/oli-nfcore-rnaseq-dev:1.0"

  1. Run the Nextflow command as an sbatch:
mlnf
cd /camp/home/ziffo/home/projects/astrocyte-fractionation-bulk-rnaseq/oli-nfcore-rnaseq-dev
nextflow pull ojziff/rnaseq ## UPDATE PIPLINE  

## RUN PIPELINE sbatch -N 1 -c 4 --mem=32G -t 48:00:00 --mail-type=ALL,ARRAY_TASKS --mail-user=oliver.ziff@crick.ac.uk --wrap=
"nextflow run ojziff/rnaseq
-with-singularity /camp/home/ziffo/home/nextflow/singularity/luslab-oli-nfcore-rnaseq-dev-1.0.img
-profile crick
--reads '/camp/home/ziffo/home/projects/astrocyte-fractionation-bulk-rnaseq/reads/*_R{1,2}.fastq.gz'
--reverseStranded
--fasta '/camp/home/ziffo/home/genomes/sequences/human/Homo_sapiens.GRCh38.dna.alt.fa'
--transcript_fasta '/camp/home/ziffo/home/genomes/index/kallisto/Homo_sapiens.GRCh38.cdna.all.fa'
--gtf '/camp/home/ziffo/home/genomes/annotation/Homo_sapiens.GRCh38.99.gtf'
--bed12 '/camp/home/ziffo/home/genomes/annotation/Homo_sapiens.GRCh38.99.bed' \ --saveReference
--aligner star --saveAlignedIntermediates
--pseudoaligner salmon
--outdir '/camp/home/ziffo/home/projects/astrocyte-fractionation-bulk-rnaseq/oli-nfcore-rnaseq-dev'
--email oliver.ziff@crick.ac.uk \
-resume"

export NXF_WORK="/camp/home/ziffo/home/nextflow/projects/scrnaseq-repro"

Remove Containers & Images from local Mac

Containers & images are huge so need to be pruned regularly.

docker container prune # removes all stopped containers
docker image rm * # removes all images - replace * with image ID to remove individual IDs

nf-core/scrnaseq

https://github.com/nf-core/scrnaseq/blob/master/main.nf

mlnf
cd /camp/home/ziffo/home/nextflow/nf-core/scrnaseq
sbatch -N 1 -c 4 --mem=32G -t 48:00:00 --wrap="nextflow run nf-core/scrnaseq \
	--reads '/camp/stp/babs/outputs/gandhi-patani/doaa.taha/fastq_pooled/**/*_R{1,2}_001.fastq.gz' \
	--fasta '/camp/home/ziffo/home/genomes/sequences/human/Homo_sapiens.GRCh38.dna.alt.fa' \
	--gtf '/camp/home/ziffo/home/genomes/annotation/Homo_sapiens.GRCh38.99.gtf' \
	--transcriptome_fasta '/camp/home/ziffo/home/genomes/index/kallisto/Homo_sapiens.GRCh38.cdna.all.fa' \
	--save_reference \
	--email oliver.ziff@crick.ac.uk \
	-profile crick" --mail-type=ALL,ARRAY_TASKS --mail-user=oliver.ziff@crick.ac.uk
-with-singularity /camp/home/ziffo/home/nextflow/nf-core/scrnaseq/work/singularity/nfcore-scrnaseq-1.0.0.img

Container

singularity pull --name nfcore-scrnaseq-1.0.0.img docker://nfcore/scrnaseq:1.0.0

Profiles

-profile

nextflow run main-test.nf -resume

Configuration

crick profile config
https://github.com/nf-core/configs/blob/master/conf/crick.config

Debugging

more .nextflow.log
more .command.sh

Nextflow course

March 2020

Write Script

Define processes within the pipeline.

  
params.str = 'Hello world!'

process splitLetters {

output:
<span class="token function">file</span> <span class="token string">'chunk_*'</span> into letters

<span class="token string">""</span><span class="token string">"

printf '${params.str}' | split -b 6 - chunk_ """ }

process convertToUpper {

input:
<span class="token function">file</span> x from letters.flatten<span class="token punctuation">(</span><span class="token punctuation">)</span>

output:
stdout result

<span class="token string">""</span><span class="token string">"

cat $x | tr '[a-z]' '[A-Z]' """ }

result.view { it.trim() }

Save script as filename.nf

Execute Script

nextflow run filename.nf

It will output something similar to:

N E X T F L O W  ~  version 19.04.0
executor >  local (3)
[69/c8ea4a] process > splitLetters   [100%] 1 of 1 ✔
[84/c8b7f1] process > convertToUpper [100%] 2 of 2 ✔
HELLO
WORLD!

Processes are executed in parallel where possible.
The [69/c8ea4a] numbers refer to each unique process. These are the prefix of the directories where each process is execues. You can inspect them in $PWD/work.

## RUN PIPELINE  
nextflow run nf-core/atacseq \  
-r 1.1.0 \  
--input "design-atac.csv" \  
-profile crick \  
-with-singularity /camp/apps/misc/stp/babs/nf-core/singularity/atacseq/1.1.0/nfcore-atacseq-1.1.0.img \  
--email [oliver.ziff@crick.ac.uk](mailto:chris.cheshire@crick.ac.uk) \  
--genome BDGP6 \  
--save_reference \  
-resume  

Modify and Resume

Only processes that are changed are re-excecuted. For processes not changed, the process cached result is used which saves time rerunning the whole pipeline.

After editing re-execute the script with:
nextflow run tutorial.nf -resume

It will print output similar to this:

N E X T F L O W  ~  version 19.04.0
executor >  local (2)
[69/c8ea4a] process > splitLetters   [100%] 1 of 1, cached: 1 ✔
[d0/e94f07] process > convertToUpper [100%] 2 of 2 ✔
olleH
!dlrow

$PWD/work caches all results which takes up a lot of space - clean the folder regularly.

Pipeline parameters

Set parameters with -- then parameter name then parameter
nextflow run tutorial.nf --str 'Bonjour le monde'

Configuration file

nextflow.config file needs to be in the pipeline execution directory. The file defines which executor to use; environmental variables; pipeline parameters.

5:29

scRNAseq nfcore pipeline

https://nf-co.re/scrnaseq

UPDATE PIPLINE

nextflow pull nf-core/atacseq

Course

Course notes for DAY 1 https://www.seqera.io/training/day1/index.html
2nd March 2020

ssh ubuntu@ec2-34-253-38-90.eu-west-1.compute.amazonaws.com

Singularity
Docker containers
Exercutors: cloud or own infrastructure
Error recovery: resume jobs
New features with DSL 2.0: modular components

Workflows

Automation - push the button pipeline
embarrassingly parellelisation: spawn 100s of jobs over distributed cluster
mash up different tools & scripts (dependencies)
dependency trees can be very complex: processes (circles) connected by arrows. E.g. 70 process tasks.
Reproducability in computational biology - different libraries, different softwares & execution environment, rounding errors
Challenges: reproducible; portable; scalable; usable; traceable.

Fundamentals for scaleable workflows:
takes code in any language: R, bash
define software depndecies: conda, docker
define dataflow programme
version control: github
deploy in a cloud: google cloud, aws, platform computing

Fast protocotyping: custom DSL. Enables tasks compositions. Written in Groovy (like python for java).
Easy parallelisation: what is in channels
Self contained: each task is independent- can run separate sections of different environments.
Portable deployments

Example:

Input:
file ref.fa from genome_channel
file sample.fq from reads_ch

output:
file sample.bam into bam_ch

script:
“”"
bwa mem ref.fa sample.fq | samtools sort -o sample.bam
“”"

output of one process is the input of the next process.

Dataflow

input > splitting > tasks / map / task > filter > collect ? result
channel = contains elements
process =execute for each element. each exectuion = task.

implicit parallelism: will run process as many times as many files are delivered to it (with wildcard *)

Channel.fromFilePairs("*_{1,2}.fq")
( iPSC, [ipsc_1.fq, ipsc_2.fq])

Deployment scenarios

local execution: nextflow - docker/singularity container - OS - local storage
centralised orchestration: nextflow command > each task becomes a cluster job.
distributed orchestration: single job to cluster, but nextflow will split jobs are required.

process {
executor
queue

Containers

Docker or Singularity (also Podman)
faster startup. lighter.
layers. can be built upon.
transparent - see how they are built
always use containers. makes things much easier.

Execution reports

CPU usage, time >> optimise pipeline
QC look at tasks. shows what takes longer.
DAG visualisation: shows the pipeline ordering.
Text editors: Atom, Visual studio code - has ssh client & nextflow highlight & docker

nf-core

community effort to collect production ready analysis pipelines built with nextflow

Directories

Files are saved to directories for each task. the naming of each directory is noted in the nextflow output on terminal
Sub tasks filed are put into sub directories.

Debug

errors can be investigated by going to the work directory that has failed. get the directory ID from the nextflow output.
more .command.sh

nextflow run script.nf

ubuntu@ip-172-31-40-29:~/nf-course/day1$ nextflow run simple.nf
N E X T F L O W  ~  version 20.01.0
Launching `simple.nf` [agitated_cantor] - revision: d5b8b09fe6
executor >  local (3)
[a8/4a74ce] process > splitLetters   [100%] 1 of 1 ✔
[81/bb9bde] process > convertToUpper [100%] 2 of 2 ✔
olleH
!dlrow

The working directories are a8/4a74ce and 81/bb9bde

View

letters.flatten().view()at the end of script shows directory structure. Helpfil to debug.View content of channel.

Dump

letters.flatten().dump() prints contents of all channels

Syntax

Nextflow scripts are written in Groovy language (like python). http://docs.groovy-lang.org/latest/html/groovy-jdk/java/util/List.html
println.
comments // (not #)
assign variables with = e.g. x = 1
define local variables with def
lists are with [ list ] list = [10,20,30]
access element of list with [ ] or list.get(0) # first element is 0 with groovy
maps are lists with any character (not just integers).
map = [a:0, b:1, c:2]
strings defined in single or double quotes
multiline string with tripple double quote

"""
Hello
Here here
"""

if statement:


if( < statement > ) {
//true branch
}
else {
//false branch
}

if list: if( < statement > ) { println list } else { println 'THe list is empty' } println( list ?: "The list is empty" )

for loop syntax

for (int i = 0; i <3; i++) {
   println("Hello World $i")
}

Closures: define a function like a variable.
square = { it * it } square function
println square.call(5) prints the result of 5 * 5

list = [1,2,3] list.each { println (it * it) }
z = list.collect { it * it) } # saves result of closure to variable z

Modifying scripts

nextflow run script.nf -resume
avoids rerunning the whole script from scratch

Define parameters at start

params.greeting = “Hello World”
params.outdir = “some/path”

Change greeting: nextflow run simple.nf --greeting "Hola Mundo"
note -- not - greeting.
-resume only needs 1 -

Define channels at start

greeting_ch = Channe;.from ()

Channels

Channels logically connect taks to each other.
Doesnt need to wait for implementation of other tasks. The task will run as soon as the input files are available.
Only 1 task can consume a channel.

2 types of channels:

  1. Queue = asynchronous unidirectional FIFO connecting 2 processing
  2. Value = singleton channel. Bound to single value. Can be read unlimited times. Not consumed.

fromPath

creates a queue channel.
Uses wildcard to take in all files!

ch = Channel.fromPath('data/ggal/*.fq')
ch.view()

Can take in all subdirectories:

ch = Channel.fromPath('data/**/*.fq')
ch.view()

fromFilePairs handles pairs of files:

Channel
    .fromFilePairs('/my/data/SRR*_{1,2}.fastq')
    .println()

{1,2} wildcard takes files ending 1 or 2.

Channel
.fromFilePairs('data/ggal/*_{1,2}.fq')
.println()

fromSRA

returns all fastq files from a single SRA ID

Channel
.fromSRA('SRP043510')
.view()

Processes

Script block is the command that runs the process. Each process has 1 script block.
Default is bash script but scan change to python, R eg

process example {
    script:
    """
    blastp -db /data/blast -query query.fa -outfmt 6 > blast_result
    cat blast_result | head -n 10 | cut -f 2 > top_hits
    blastdbcmd -db /data/blast -entry_batch top_hits > sequences
    """
}

process pyStuff { script: """ #!/usr/bin/env python

x = 'Hello' y = 'world!' print "%s - %s" % (x,y) """ }

Verbose using echo

echo true in script
or add -process.echo true

escape bash variables

set bash variables e.g. FASTQ = .fastq
in script escape with \

  """
X = 1
echo Hello \$X
  """
}

Conditional Script

Default aligner is kallisto, but can change aligner to salmon with --aligner= 'salmon'

params.aligner = 'kallisto'

process foo { script: if( params.aligner == 'kallisto' ) """ kallisto --reads /some/data.fastq """ else if( params.aligner == 'salmon' ) """ salmon --reads /some/data.fastq """ else throw new IllegalArgumentException("Unknown aligner $params.aligner") }

Input

syntax:

input: 
	<input qualifier> <input name> from <source channel>

val is the inpur qualifier

num = Channel.from( 1, 2, 3 )

process basicExample { input: val x from num

""" echo process job $x """ }

reads = Channel.fromPath( 'data/ggal/*.fq' )
process foo {
input:
file 'sample.fastq' from reads
script:
"""
echo your_command --reads sample.fastq
"""
}

See the links to these files in work directory:
tree work/

Output

With tuple can use sample_id which contains the metadata & actual data.

Write a pipeline

4 processes:
Index a transcritome file
QC fastqc
quantification
MultQC report

Index

params.reads = "$baseDir/data/ggal/*_{1,2}.fq"
params.transcriptome = "$baseDir/data/ggal/transcriptome.fa"
params.multiqc = "$baseDir/multiqc"
println "reads: $params.reads"

nexflow run script1.nf --reads

Log

cat .nextflow.log

Containers have software

nextflow run script2.nf -with-docker
-with-docker brings in the container from the config:

nextflow.config

process.container = 'nextflow/rnaseq-nf'
docker.runOptions='-u $(id -u):$(id -g)'

add to config to mean -with-docker not needed every time
docker.enabled = true

tree work/ shows all files in directories in work/
$task.cpus can be specified outside of the script. Makes it easy to change CPUs.

.collect will group all files into a single working directory to run into a single channel
.mix will combine channels

Long R scripts

mkdir bin/
add to bin the long R script
chmod +x script.sh # make executable
nextflow will auto add bin to path
can remove script from nextflow & call external R script - give input.

GitHub

repository

data/
docker
nextflow.config
main.nf

nextflow run nextflow-io/rnaseq-nf -with-docker # pulls the code & docker

Specify revision / version

nextflow run -r v1.2
can be a version, revision, commit ID.

Dependencies

Docker

Docker client is a service that runs in the background. Finds image by name. If not on local machine it can pull it from remote Docker hub >> creates a container from the image >> runs command
NB security concerns with docker permissions

docker run -it debian:jessie

changes directory to the container
exit to return to previous directory

Dockerfile

Script to create image

FROM debian:jessie

MAINTAINER <your name>

RUN apt-get update && apt-get install -y curl cowsay

ENV PATH=$PATH:/usr/games/

RUN curl -sSL https://github.com/COMBINE-lab/salmon/releases/download/v1.0.0/salmon-1.0.0_linux_x86_64.tar.gz | tar xz
&& mv /salmon-/bin/ /usr/bin/
&& mv /salmon-/lib/ /usr/lib/

Aways build containers in empty directories.
mkdir docker code Dockerfile
tag image default is latest: Successfully tagged my-image:latest

Build the docker container:
docker build -t my-image .

Run interactively inside of container:

docker run -it my-image bash

Use container software & files to run a command. All files to run with container need to be accessible - mount the current file system into the container:

docker run --volume $HOME:$HOME --workdir -u $(id -u):$(id -g) $PWD my-image  salmon index -t $PWD/data/ggal/transcriptome.fa -i transcript-index 

-u $(id -u):$(id -g)

Docker hub

publish container publically to docker hub account: tag it with Docker username, then push

docker tag my-image <user-name>/my-image

docker push <user-name>/my-image

download a container with pull

docker pull <user-name>/my-image

Docker hub with nextflow

nextflow run script2.nf -with-docker my-image

See the nextflow commands that were run:

cd /work/nextflow_ID_directory/
more .command.run

Singularity

secure version of Docker for use on HPCs. permissions are from user - not from root.

Singularity file:

Bootstrap: docker
From: debian:jessie

%environment export PATH=$PATH:/usr/games/

%labels AUTHOR <your name>

%post

apt-get update && apt-get install -y locales-all curl cowsay curl -sSL https://github.com/COMBINE-lab/salmon/releases/download/v1.0.0/salmon-1.0.0_linux_x86_64.tar.gz | tar xz
&& mv /salmon-/bin/ /usr/bin/
&& mv /salmon-/lib/ /usr/lib/

Run file with:

sudo singularity build my-image.sif Singularity

Run container:

singularity exec my-image.sif cowsay 'Hello Singularity'

Run interactively from inside container:

singularity shell my-image.sif

exit to return to previous directory

best to write images in docker rather than singularity. Singularity will format docker files automatically:

singularity pull docker://debian:jessie

run with nextflow

nextflow run script7.nf -with-singularity nextflow/rnaseq-nf

this creates a singularity directory if not specified

Singularity library: SyLabs

https://sylabs.io/
https://cloud.sylabs.io/home

Conda/Bioconda

Conda environment is defined in a YAML file:

# CONDA ENVIRONMENT FILE 
name: nf-tutorial
channels:
  - defaults
  - bioconda
  - conda-forge
dependencies:
  # Default bismark
  - salmon=0.8.2
  - fastqc=0.11.5
  - multiqc=1.5

create conda envrionment:

conda env create --file env.yml

activate conda environment:

conda activate nf-tutorial

run nextflow with conda:

nextflow run script7.nf -with-conda /home/ubuntu/miniconda2/envs/nf-tutorial

Best to put the conda environment inside the Dockerfile with COPY conda.yml:

FROM continuumio/miniconda:4.7.12

MAINTAINER Paolo Di Tommaso <paolo.ditommaso@gmail.com>

RUN apt-get -y install ttf-dejavu

COPY conda.yml . RUN
conda env update -n root -f conda.yml
&& conda clean -a

RUN apt-get install -y procps

This docker image has miniconda >> add conda environment

Nextflow Configuration

Nextflow looks for nextflow.config in:

  1. current directory
  2. then in script base directory
  3. then $HOME/.nextflow/config

Config Syntax

propertyOne = 'world'
anotherProp = "Hello $propertyOne"
customPath = "$PATH:/my/app/folder"

access any variable defined in the host environment such as $PATH, $HOME, $PWD, etc.

Add Comments:

// comment a single config file

/* a comment spanning multiple lines */

Define properties either with dot prefix:
alpha.x = 1
alpha.y = ‘string value…’

Or group together:
beta {
p = 2
q = ‘another string …’
}

Define params:

// workflow script
params.foo = 'Hello'
params.bar = 'world!'

// print the both params println "$params.foo $params.bar"

Can change params from the command line eg change foo to Hola

nextflow run params.nf --foo Hola

Config env

defines where workflow will be executed

env.ALPHA = 'some value'
env.BETA = "$HOME/some/path"

Config process

settings for the task execution such as cpus, memory, container and other resources in the pipeline script:

process {
    cpus = 10
    memory = 8.GB
    container = 'biocontainers/bamtools:v2.4.0_cv3'
}

better to define these in the config than the nextflow script

process { memory = { 4.GB * task.cpus } }

Docker execution

specify container image:

process.container = 'nextflow/rnaseq-nf@sha256:aeacbd7ea1154f263cda972a96920fb228b2033544c2641476350b9317dab266'
docker.enabled = true

or with sinuglarity

process.container = '/some/singularity/image.sif'
singularity.enabled = true

Use singularity to pull docker image:

process.container = 'nextflow/rnaseq-nf'
singularity.enabled = true

following protocols are supported:

  • library:// download the container image from the Singularity Library service
  • shub:// download the container image from the Singularity Hub.
  • docker:// download the container image from the Docker Hub and convert it to the Singularity format.
  • docker-daemon:// pull the container image from a local Docker installation and convert it to a Singularity image file.

Conda execution

a Conda environment can be provided in the configuration file adding the following setting in nextflow.config:

process.conda = "/home/ubuntu/miniconda2/envs/nf-tutorial"

path is to either the conda directory or the YAML file

Deployment on HPC

To run your pipeline with a batch scheduler modify the nextflow.config file specifying the target executor and the required computing resources if needed. For example:

process.executor = 'slurm'

specify the amount of resources i.e. cpus, memory, execution time, etc. required by each task. Use the scope process to define the resource requirements for all processes in your workflow applications. For example:

process {
    executor = 'slurm'
    queue = 'short'
    memory = '10 GB'
    time = '30 min'
    cpus = 4
}

Different tasks need different amount of computing resources. It is possible to define the resources for a specific task using the select withName: followed by the process name:

process {
executor = ‘slurm’
queue = ‘short’
memory = ‘10 GB’
time = ‘30 min’
cpus = 4

withName: foo {
    cpus = 4
    memory = '20 GB'
    queue = 'short'
}

withName: bar { cpus = 8 memory = '32 GB' queue = 'long' }

}

Label processes

When a workflow is composed by many processes it can be arduous listing all process names in the config to specify the resources for each of them. Better to annotate the processes with a label . Then specify the resources in the config using for all processes having the same label.

workflow script:

process task1 {
  label 'long'

""" first_command --here """ }

process task2 { label 'short'

""" second_command --here """ }

config:

process {
    executor = 'slurm'
withLabel: 'short' {
    cpus = 4
    memory = '20 GB'
    queue = 'alpha'
}

withLabel: 'long' {
    cpus = 8
    memory = '32 GB'
    queue = 'omega'
}

}

Configure multiple containers in a single workflow

different container for each process in your workflow:

process {
  withName: foo {
    container = 'some/image:x'
  }
  withName: bar {
    container = 'other/image:y'
  }
}

docker.enabled = true

Config profiles

Configuration files can contain the definition of one or more profiles. A profile is a set of configuration attributes that can be activated/chosen when launching a pipeline execution by using the -profile command line option. Configuration profiles are defined by using the special scope profiles which group the attributes that belong to the same profile using a common prefix. For example:

profiles {
standard {
    params.genome = '/local/path/ref.fasta'
    process.executor = 'local'
}

cluster {
    params.genome = '/data/stared/ref.fasta'
    process.executor = 'sge'
    process.queue = 'long'
    process.memory = '10GB'
    process.conda = '/some/path/env.yml'
}

cloud {
    params.genome = '/data/stared/ref.fasta'
    process.executor = 'awsbatch'
    process.container = 'cbcrg/imagex'
    docker.enabled = true
}

}

set different process configuration strategies depending on the target runtime platform. By convention the standard profile is implicitly used when no other profile is specified by the user.

enable a specific profile use -profile option followed by the profile name:

nextflow run <your script> -profile cluster

Organise Outputs

Nextflow manages independently workflow execution intermediate results from the pipeline expected outputs. Task output files are created in the task specific execution directory which is considered as a temporary directory that can be deleted upon completion. The pipeline result files need to be marked explicitly using the directive publishDir in the process that’s creating such file. For example:

process makeBams {
    publishDir "/some/directory/bam_files", mode: 'copy'
input:
file index from index_ch
tuple val(name), file(reads) from reads_ch

output:
tuple val(name), file ('*.bam') into star_aligned

"""
STAR --genomeDir $index --readFilesIn $reads
"""

}

this copies all bam files created by the star task into /some/directory/bam_files

Subdirectories

use more then one publishDir to keep different outputs in separate directory

params.reads = 'data/reads/*_{1,2}.fq.gz'
params.outdir = 'my-results'

Channel .fromFilePairs(params.reads, flat: true) .set{ samples_ch }

process foo { publishDir "$params.outdir/$sampleId/", pattern: '.fq' publishDir "$params.outdir/$sampleId/counts", pattern: "_counts.txt" publishDir "$params.outdir/$sampleId/outlooks", pattern: '*_outlook.txt'

input: set sampleId, file('sample1.fq.gz'), file('sample2.fq.gz') from samples_ch output: file "*" script: """ < sample1.fq.gz zcat > sample1.fq < sample2.fq.gz zcat > sample2.fq

awk '{s++}END{print s/4}' sample1.fq &gt; sample1_counts.txt
awk '{s++}END{print s/4}' sample2.fq &gt; sample2_counts.txt

head -n 50 sample1.fq &gt; sample1_outlook.txt
head -n 50 sample2.fq &gt; sample2_outlook.txt

""" }

create an output structure in the directory my-results, which contains a separate sub-directory for each given sample ID each of which contain the folders counts and outlooks

Cache

Nextflow caching mechanism works assigning a unique ID to each task which is used to create a separate execution directory where the tasks are executed and the results stored. The task unique ID is generated as a 128-bit hash number obtained composing the task inputs values and files and the command string. View this structure with tree work/

Resume

The -resume command line option allow the continuation of a pipeline execution since the last step that was successfully completed:

nextflow run <script> -resume

Nextflow uses the task unique ID to check if the work directory already exists and it contains a valid command exit status and the expected output files. If this condition is satisfied the task execution is skipped and previously computed results are used as the process results. The first task, for which a new output is computed, invalidates all downstream executions in the remaining DAG.

Work/

The task work directories are created in the folder work/ in the launching path by default. This is supposed to be a scratch storage area that can be cleaned up once the computation is completed. The final output are supposed to the stored in a different location specified using one or more publishDir directive.

A different location for the execution work directory can be specified using the command line option -w e.g.

nextflow run <script> -w /some/scratch/dir