Shell scripts and workflows for working with Nanopore data. Submits jobs to CDC's Aspen HPC using qsub
.
There are 2 main workflows:
run_basecall-w-gpu.sh
- Guppy GPU basecalling, demultiplexing, and trimming. Followed by NanoPlot for generating seq run stats and graphs.workflow-after-gpu-basecalling.sh
- Assembly with Flye and polishing with Racon and Medaka
Download the repository from the latest release (v0.5.0 is latest as of March 2021) and uncompress.
$ wget https://github.com/kapsakcj/nanoporeWorkflow/archive/v0.5.0.tar.gz
$ tar -xzf v0.5.0.tar.gz
Optional - add the workflows to your $PATH (edit the PATH below to wherever you downloaded the repo). Refresh your environment by source
'ing your .bashrc
file.
# Be careful with this command - make sure the PATH is properly edited!
$ echo 'export PATH=$PATH:/path/to/nanoporeWorkflow-0.5.0/workflows' >> ~/.bashrc
$ source ~/.bashrc
run_basecall-w-gpu.sh
- Guppy GPU basecalling, demultiplexing, and adapter/barcode trimming. Followed by NanoPlot for generating seq run stats and graphs.
- Must be logged into a server with the ability to run
qsub
for submitting jobs to Aspen:- Aspen head node
- Aspen interactive node (run
qlogin
from aspen head node) - Monoliths 1-3 (cannot submit jobs from M4)
- Currently supported MinION data:
- R9.4.1 flowcell (FLO-MIN106)
- Rapid barcoding kit (RBK-004)
- Ligation sequencing kit (SQK-LSK109) + native barcodes 1-24 (NBD103/104/114)
- R10 flowcell
- Ligation sequencing kit (SQK-LSK109) + native barcodes 1-24 (NBD103/104/114)
- R9.4.1 flowcell (FLO-MIN106)
- Unsupported combos (want to add in the future!):
- R9.4.1 + rapid or ligation sequencing kit without barcoding (RAD-004)
- R10 + ligation without barcoding
- R10.3 + ligation with & without barcoding
- Takes in 5 arguments (a double pipe
||
is the same as OR):-i path/to/fast5files/
- the input directory containing raw fast5 files-o path/to/outputDirectory/
- an output directory-b y || yes || n || no
- were barcodes used?-f r941 || r10
- flowcell type used?-k rapid || ligation
- sequencing kit used?
- copies fast5s from input directory to
/scicomp/scratch/$USER/guppy.gpu.XXXXXX
- runs
guppy_basecaller
in high-accuracy mode on a GPU - Demultiplexes using
guppy_basecaller
and additionally trims adapter and barcode sequences (using--trim_barcodes ; --barcode_kits "EXP-NBD104 EXP-NBD114" or "SQK-RBK004"
options) - Compresses the demultiplexed reads (
--compress_fastq
option) - Copies demultiplexed, trimmed, compressed reads into subdirectories in
$OUTDIR/demux/barcodeXX
- Runs
NanoPlot
to generate sequencing run stats on the entire run, as well as individual barcodes.
Pull up help/usage statement by running run_basecall-w-gpu.sh
or run_basecall-w-gpu.sh -h
Usage: /path/to/nanoporeWorkflow-0.5.0/workflows/run_basecall-w-gpu.sh
-i path/to/fast5files/ searches recursively for fast5 files
-o path/to/outputDirectory/ output directory
-b y || yes || n || no barcodes used?
-f r941 || r10 flowcell type used?
-k rapid || ligation sequencing kit used?
example: /path/to/nanoporeWorkflow-0.5.0/workflows/run_basecall-w-gpu.sh -i fast5s/ -o output/ -b y -f r941 -k rapid
# EXAMPLE OUTPUT (reduced for brevity)
$OUTDIR
├── demux
│ ├── barcode01
│ │ └── fastq_runid_fbc8eee46271cbe60ee8a49d0ca657f6e92e174e_0_0.fastq.gz (there will be many .fastq.gz files per barcode)
│ ├── barcode02
│ │ └── fastq_runid_fbc8eee46271cbe60ee8a49d0ca657f6e92e174e_0_0.fastq.gz
│ ├── barcode03
│ │ └── fastq_runid_fbc8eee46271cbe60ee8a49d0ca657f6e92e174e_0_0.fastq.gz
│ ├── guppy-logs
│ └── guppy_basecaller_log-2020-04-17_09-45-00.log (there will be many guppy-logs)
│ ├── nanoplot
│ └── NanoPlot-report.html # additionally all images and other files produced by NanoPlot
│ ├── nanoplot-barcoded
│ └── NanoPlot-report.html # additionally all images and other files produced by NanoPlot, but for each barcode
│ ├── sequencing_summary.txt
│ ├── sequencing_telemetry.js
│ └── unclassified
│ └── fastq_runid_fbc8eee46271cbe60ee8a49d0ca657f6e92e174e_0_0.fastq.gz
└── log # qsub logs
└── guppy.log
└── nanoplot.log
workflow-after-gpu-basecalling.sh
- Assembly with Flye and polishing with Racon and Medaka
- Must have previously run the above workflow
run_basecall-w-gpu.sh
- Must be logged into a server with the ability to
qsub
(Aspen, Monoliths 1-3). OUTDIR
argument must be the same directory as theOUTDIR
from therun_basecall-w-gpu.sh
workflow
- Takes in 1 argument:
$outdir
- The output directory from runningrun_basecall-w-gpu.sh
, which containdemux/barcodeXX/
subdirectories
- Prepares a barcoded sample - concatenates all fastq files into one, compresses, and counts read lengths
- Runs
filtlong
to remove reads <500bp and downsample reads to 600 Mb (roughly 120X for a 5 Mb genome) - Assembles downsampled/filtered reads using
flye
(--plasmids
and-g 5M
options used) - Polishes flye draft assembly using racon 4 times
- Polishes racon polished assembly using Medaka (specific to r9.4.1 flowcell, high accuracy basecaller model, and guppy version 3.6.x,
--m r941_min_high_g360
option used) - Final, polished assembly for each barcode can be found in each barcode subdirectory
demux/barcodeXX/final.asm.barcodeXX.fasta
Pull up help/usage statement by running workflow-after-gpu-basecalling.sh
or workflow-after-gpu-basecalling.sh -h
# note: ensure that the outdir supplied in this command is the exact same as the outdir you
# supplied when you ran the run_basecall-w-gpu.sh script
Usage: /path/to/nanoporeWorkflow-0.5.0/workflows/workflow-after-gpu-basecalling.sh outdir/
This workflow runs the following on barcodes 01-24:
filtlong removes reads <500bp and downsamples to 600Mb (roughly 120X for 5Mb genome)
flye assembles reads. --plasmids and -g 5M options used
racon polishes 4X with Racon
medaka polishes once with Medaka using r9.4.1 pore and HAC guppy basecaller profile
# EXAMPLE OUTPUT - only showing one barcode for brevity
$OUTDIR/
├── demux
│ ├── barcode01
│ │ ├── all.fastq.gz
│ │ ├── flye
| | ├── final.asm.barcodeXX.fasta
│ │ ├── log # qsub logs for each barcode
│ │ │ ├── assemble-d64ffbc5-4012-44c5-8191-1a57d4a7d15c.log
│ │ │ ├── polish-medaka-00e52c16-0bd3-460d-b955-3a532be958b1.log
│ │ │ ├── polish-racon-d7ebc124-d100-43e0-b347-1e60bbc0bf18.log
│ │ │ └── prepSample-7ecc6f51-4937-40d1-a6bd-d83e66078984.log
│ │ ├── medaka
│ │ ├── racon
│ │ ├── readlengths.txt.gz
│ │ └── reads.minlen1000.600Mb.fastq.gz
│ ├── guppy-logs
│ ├── nanoplot
│ ├── nanoplot-barcoded
│ ├── sequencing_summary.txt
│ ├── sequencing_telemetry.js
│ └── unclassified
└── log # qsub logs
└── guppy.log
└── nanoplot.log
- It will check for the following files, to determine if it should skip any of the steps. Helps if one part doesn't run correctly and you don't want to repeat a certain step, e.g. re-assembling.
np_filter_filtlong.sh
looks for./demux/barcodeXX/reads.minlen500.600Mb.fastq.gz
np_assemble_flye.sh
looks for./demux/barcodeXX/flye/assembly.fasta
np_consensus_racon.sh
looks for./demux/barcodeXX/racon/ctg.consensus.iteration4.fasta
np_polish_medaka.sh
looks for./demux/barcodeXX/medaka/consensus.fasta
If you are interested in contributing to nanoporeWorkflow, please take a look at the contribution guidelines. We welcome issues or pull requests!
- Add support for passing in a config file to
workflow-after-gpu-basecalling.sh
that contains:- Sample ID
- barcode number (RBK or NBD)
- estimated genome size (to be used as input parameter in various places)
- Allow users to specify a read length for removing reads w/
filtlong
- Test and add support for
rasusa
for randomly subsampling and filtering reads (filtlong
is biased towards reads with highest q-scores) - add flags/options for other sequencing kits, barcoding kits, flowcells (direct RNAseq?)
- R10.3 + ligation with native barcodes 1-24 (R10 flowcell discontinued)
- https://github.com/fenderglass/Flye
- https://github.com/isovic/racon
- https://github.com/nanoporetech/medaka
- https://github.com/nanoporetech/qcat
- How to set Guppy parameters (requires Nanopore Community login credentials) https://community.nanoporetech.com/protocols/Guppy-protocol/v/gpb_2003_v1_revl_14dec2018/how-to-configure-guppy-parameters