Skip to content

A Snakemake workflow to basecall, quality control and assemble Nanopore data.

Notifications You must be signed in to change notification settings

alipirani88/Nanosake

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

45 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Nanosake

A Snakemake workflow to quality control and assemble Nanopore data using Long reads first and Hybrid approach.

Summary

The Nanosake workflow can be split into two parts - based on the type of data it uses first to assemble the reads (Long reads first Flye assembly vs Hybrid Unicycler assembly)

In short, it performs the following steps:

  • Runs Filtlong to remove low quality reads and discards reads less than 1000 Bp. (--min_length 1000 --keep_percent 95)
  • Generates pre and post-Filtlong QC plots using Nanoplot.
  • Assemble clean filtlong nanopore reads with Flye assembler.
  • Flye-only assembly is then polished with long reads using Medaka, followed by polishing with clean trimmed illumina reads using Polypolish - to generate Flye+Medaka+Polypolish assembly.
  • For the short illumina read first approach, Long reads are assembled in hybrid mode using Unicycler - medium mode, followed by polishing with short reads using Polypolish.
  • The Flye-only, Flye+Medaka+Polypolish, Unicycler+Polypolish assembles the passes through Prokka for annoatation, BUSCO for assembly completeness statistics and QUAST for assembly statistics.

All the final assembly and annotation are finally placed in prokka directory.

Installation

Clone the github directory onto your system.

git clone https://github.com/alipirani88/Nanosake.git

The workflow makes use of singularity containers available from State Public Health - Bioinformatics group. If you are working on UMICH Great Lakes - you can load snakemake and singularity module in the following manner. If you are trying to run it on your local or other computing platform, make sure you have snakemake and singularity installed.

Load bioinformatics and snakemake module from Great Lakes modules.

module load Bioinformatics
module load snakemake singularity

As an input, the snakemake file takes a config file where you can set the path to sample.tsv sample lookup file, path to your long and short reads, path to adapter fasta file etc. sample.tsv should be a comma seperated file, consisting of four columns - barcode_id(barcode id outputted by Minion or can be same as sample_id in case you dont have barcode information),sample_id (genome ID - this should be prefix that should be extracted from your illumina fastq reads),illumina_r1 (forward read 1),illumina_r2 (reverse read 2)

You can create sample.tsv file using the following for loop. It assumes that you are running for loop from the folder that contains your illumina short reads.


echo "barcode_id,sample_id,illumina_r1,illumina_r2" > sample.tsv

for read1 in *_R1.fastq.gz; do sample_id=`echo $i | sed 's/_R1.fastq.gz//g'`; read2=`echo $i | sed 's/_R1.fastq.gz/_R2.fastq.gz/g'`; echo $sample_id,$sample_id,$read1,$read2; done >> sample.tsv 

Customize config.yaml and set tool specific parameters.

Set path to your sample.tsv, ONT long reads and illumina short reads path in config.yaml file.

Quick start

Run Nanosake on a set of samples.

snakemake -s Nanosake.smk -p --use-conda -j 999 --cluster "sbatch -A {cluster.account} -p {cluster.partition} -N {cluster.nodes}  -t {cluster.walltime} -c {cluster.procs} --mem-per-cpu {cluster.pmem}" --conda-frontend conda --cluster-config config/cluster.json --configfile config/config.yaml --latency-wait 1000

Alt text

The workflow generates all the output in the output prefix folder set in the config file. Each workflow steps gets its own individual folder as shown below:

results/2024-05-01_Test_singularity/
├── busco
├── filtlong
├── flye
├── medaka
├── nanoplot
├── polypolish
├── polypolish_unicycler
├── prokka
├── quast
├── trimmomatic
└── unicycler

You can generate a MultiQC report by running it on prokka, quast, busco and nanoplot folder.

multiqc --filename multiqc_report_filename -o multiqc_output_dir -f prokka/ busco/ quast/ nanoplot/

About

A Snakemake workflow to basecall, quality control and assemble Nanopore data.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published