nanoporeWorkflow

Shell scripts and workflows for working with Nanopore data. Submits jobs to CDC's Aspen HPC using qsub.

⚠️ Don't bother reading if you aren't working on CDC's servers ⚠️

There are 2 main workflows:

run_basecall-w-gpu.sh - Guppy GPU basecalling, demultiplexing, and trimming. Followed by NanoPlot for generating seq run stats and graphs.
workflow-after-gpu-basecalling.sh - Assembly with Flye and polishing with Racon and Medaka

Install

Download the repository from the latest release (v0.5.0 is latest as of March 2021) and uncompress.

$ wget https://github.com/kapsakcj/nanoporeWorkflow/archive/v0.5.0.tar.gz 
$ tar -xzf v0.5.0.tar.gz

Optional - add the workflows to your $PATH (edit the PATH below to wherever you downloaded the repo). Refresh your environment by source'ing your .bashrc file.

# Be careful with this command - make sure the PATH is properly edited!
$ echo 'export PATH=$PATH:/path/to/nanoporeWorkflow-0.5.0/workflows' >> ~/.bashrc
$ source ~/.bashrc

Workflows

Guppy GPU basecalling, demultiplexing, trimming, and NanoPlot

run_basecall-w-gpu.sh - Guppy GPU basecalling, demultiplexing, and adapter/barcode trimming. Followed by NanoPlot for generating seq run stats and graphs.

Requirements

Must be logged into a server with the ability to run qsub for submitting jobs to Aspen:
- Aspen head node
- Aspen interactive node (run qlogin from aspen head node)
- Monoliths 1-3 (cannot submit jobs from M4)
Currently supported MinION data:
- R9.4.1 flowcell (FLO-MIN106)
  - Rapid barcoding kit (RBK-004)
  - Ligation sequencing kit (SQK-LSK109) + native barcodes 1-24 (NBD103/104/114)
- R10 flowcell
  - Ligation sequencing kit (SQK-LSK109) + native barcodes 1-24 (NBD103/104/114)
Unsupported combos (want to add in the future!):
- R9.4.1 + rapid or ligation sequencing kit without barcoding (RAD-004)
- R10 + ligation without barcoding
- R10.3 + ligation with & without barcoding

This workflow does the following:

Takes in 5 arguments (a double pipe || is the same as OR):
1. -i path/to/fast5files/ - the input directory containing raw fast5 files
2. -o path/to/outputDirectory/ - an output directory
3. -b y || yes || n || no - were barcodes used?
4. -f r941 || r10 - flowcell type used?
5. -k rapid || ligation - sequencing kit used?
copies fast5s from input directory to /scicomp/scratch/$USER/guppy.gpu.XXXXXX
runs guppy_basecaller in high-accuracy mode on a GPU
Demultiplexes using guppy_basecaller and additionally trims adapter and barcode sequences (using --trim_barcodes ; --barcode_kits "EXP-NBD104 EXP-NBD114" or "SQK-RBK004" options)
Compresses the demultiplexed reads (--compress_fastq option)
Copies demultiplexed, trimmed, compressed reads into subdirectories in $OUTDIR/demux/barcodeXX
Runs NanoPlot to generate sequencing run stats on the entire run, as well as individual barcodes.

USAGE:

Pull up help/usage statement by running run_basecall-w-gpu.sh or run_basecall-w-gpu.sh -h

Usage: /path/to/nanoporeWorkflow-0.5.0/workflows/run_basecall-w-gpu.sh
                 -i path/to/fast5files/        searches recursively for fast5 files
                 -o path/to/outputDirectory/   output directory
                 -b y || yes || n || no        barcodes used?
                 -f r941 || r10                flowcell type used?
                 -k rapid || ligation          sequencing kit used?

example: /path/to/nanoporeWorkflow-0.5.0/workflows/run_basecall-w-gpu.sh -i fast5s/ -o output/ -b y -f r941 -k rapid

# EXAMPLE OUTPUT (reduced for brevity)
$OUTDIR
├── demux
│   ├── barcode01
│   │   └── fastq_runid_fbc8eee46271cbe60ee8a49d0ca657f6e92e174e_0_0.fastq.gz (there will be many .fastq.gz files per barcode)
│   ├── barcode02
│   │   └── fastq_runid_fbc8eee46271cbe60ee8a49d0ca657f6e92e174e_0_0.fastq.gz
│   ├── barcode03
│   │   └── fastq_runid_fbc8eee46271cbe60ee8a49d0ca657f6e92e174e_0_0.fastq.gz
│   ├── guppy-logs
│       └── guppy_basecaller_log-2020-04-17_09-45-00.log (there will be many guppy-logs)
│   ├── nanoplot
│       └── NanoPlot-report.html # additionally all images and other files produced by NanoPlot
│   ├── nanoplot-barcoded
│       └── NanoPlot-report.html # additionally all images and other files produced by NanoPlot, but for each barcode
│   ├── sequencing_summary.txt
│   ├── sequencing_telemetry.js
│   └── unclassified
│       └── fastq_runid_fbc8eee46271cbe60ee8a49d0ca657f6e92e174e_0_0.fastq.gz
└── log # qsub logs
    └── guppy.log
    └── nanoplot.log

Assembly with Flye and polishing with Racon and Medaka

workflow-after-gpu-basecalling.sh - Assembly with Flye and polishing with Racon and Medaka

Requirements

Must have previously run the above workflow run_basecall-w-gpu.sh
Must be logged into a server with the ability to qsub (Aspen, Monoliths 1-3).
OUTDIR argument must be the same directory as the OUTDIR from the run_basecall-w-gpu.sh workflow

This workflow does the following:

Takes in 1 argument:
1. $outdir - The output directory from running run_basecall-w-gpu.sh, which contain demux/barcodeXX/ subdirectories
Prepares a barcoded sample - concatenates all fastq files into one, compresses, and counts read lengths
Runs filtlong to remove reads <500bp and downsample reads to 600 Mb (roughly 120X for a 5 Mb genome)
Assembles downsampled/filtered reads using flye (--plasmids and -g 5M options used)
Polishes flye draft assembly using racon 4 times
Polishes racon polished assembly using Medaka (specific to r9.4.1 flowcell, high accuracy basecaller model, and guppy version 3.6.x, --m r941_min_high_g360 option used)
Final, polished assembly for each barcode can be found in each barcode subdirectory demux/barcodeXX/final.asm.barcodeXX.fasta

USAGE

Pull up help/usage statement by running workflow-after-gpu-basecalling.sh or workflow-after-gpu-basecalling.sh -h

# note: ensure that the outdir supplied in this command is the exact same as the outdir you
# supplied when you ran the run_basecall-w-gpu.sh script
Usage: /path/to/nanoporeWorkflow-0.5.0/workflows/workflow-after-gpu-basecalling.sh outdir/

This workflow runs the following on barcodes 01-24:

filtlong     removes reads <500bp and downsamples to 600Mb (roughly 120X for 5Mb genome)
flye         assembles reads. --plasmids and -g 5M options used
racon        polishes 4X with Racon
medaka       polishes once with Medaka using r9.4.1 pore and HAC guppy basecaller profile

# EXAMPLE OUTPUT - only showing one barcode for brevity
$OUTDIR/
├── demux
│   ├── barcode01
│   │   ├── all.fastq.gz
│   │   ├── flye
|   |   ├── final.asm.barcodeXX.fasta
│   │   ├── log  # qsub logs for each barcode
│   │   │   ├── assemble-d64ffbc5-4012-44c5-8191-1a57d4a7d15c.log
│   │   │   ├── polish-medaka-00e52c16-0bd3-460d-b955-3a532be958b1.log
│   │   │   ├── polish-racon-d7ebc124-d100-43e0-b347-1e60bbc0bf18.log
│   │   │   └── prepSample-7ecc6f51-4937-40d1-a6bd-d83e66078984.log
│   │   ├── medaka
│   │   ├── racon
│   │   ├── readlengths.txt.gz
│   │   └── reads.minlen1000.600Mb.fastq.gz
│   ├── guppy-logs
│   ├── nanoplot
│   ├── nanoplot-barcoded
│   ├── sequencing_summary.txt
│   ├── sequencing_telemetry.js
│   └── unclassified
└── log # qsub logs
    └── guppy.log
    └── nanoplot.log

Notes on assembly and polishing workflow

It will check for the following files, to determine if it should skip any of the steps. Helps if one part doesn't run correctly and you don't want to repeat a certain step, e.g. re-assembling.
- np_filter_filtlong.sh looks for ./demux/barcodeXX/reads.minlen500.600Mb.fastq.gz
- np_assemble_flye.sh looks for ./demux/barcodeXX/flye/assembly.fasta
- np_consensus_racon.sh looks for ./demux/barcodeXX/racon/ctg.consensus.iteration4.fasta
- np_polish_medaka.sh looks for ./demux/barcodeXX/medaka/consensus.fasta

Contributing

If you are interested in contributing to nanoporeWorkflow, please take a look at the contribution guidelines. We welcome issues or pull requests!

Future plans

Add support for passing in a config file to workflow-after-gpu-basecalling.sh that contains:
- Sample ID
- barcode number (RBK or NBD)
- estimated genome size (to be used as input parameter in various places)
Allow users to specify a read length for removing reads w/ filtlong
Test and add support for rasusa for randomly subsampling and filtering reads (filtlong is biased towards reads with highest q-scores)
add flags/options for other sequencing kits, barcoding kits, flowcells (direct RNAseq?)
- R10.3 + ligation with native barcodes 1-24 (R10 flowcell discontinued)

Resources

https://github.com/fenderglass/Flye
https://github.com/isovic/racon
https://github.com/nanoporetech/medaka
https://github.com/nanoporetech/qcat
How to set Guppy parameters (requires Nanopore Community login credentials) https://community.nanoporetech.com/protocols/Guppy-protocol/v/gpb_2003_v1_revl_14dec2018/how-to-configure-guppy-parameters

Name		Name	Last commit message	Last commit date
Latest commit History 286 Commits
scripts		scripts
t		t
workflows		workflows
.gitattributes		.gitattributes
.gitignore		.gitignore
.gitlab-ci.yml		.gitlab-ci.yml
.travis.yml		.travis.yml
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

nanoporeWorkflow

TOC

Install

Workflows

Guppy GPU basecalling, demultiplexing, trimming, and NanoPlot

Requirements

This workflow does the following:

USAGE:

Assembly with Flye and polishing with Racon and Medaka

Requirements

This workflow does the following:

USAGE

Notes on assembly and polishing workflow

Contributing

Future plans

Resources

About

Releases 9

Packages

Contributors 2

Languages

License

kapsakcj/nanoporeWorkflow

Folders and files

Latest commit

History

Repository files navigation

nanoporeWorkflow

TOC

Install

Workflows

Guppy GPU basecalling, demultiplexing, trimming, and NanoPlot

Requirements

This workflow does the following:

USAGE:

Assembly with Flye and polishing with Racon and Medaka

Requirements

This workflow does the following:

USAGE

Notes on assembly and polishing workflow

Contributing

Future plans

Resources

About

Resources

License

Stars

Watchers

Forks

Releases 9

Packages 0

Contributors 2

Languages

Packages