A Snakemake workflow to quality control and assemble Nanopore data using Long reads first and Hybrid approach.
The Nanosake workflow can be split into two parts - based on the type of data it uses first to assemble the reads (Long reads first Flye assembly vs Hybrid Unicycler assembly)
In short, it performs the following steps:
- Runs Filtlong to remove low quality reads and discards reads less than 1000 Bp. (--min_length 1000 --keep_percent 95)
- Generates pre and post-Filtlong QC plots using Nanoplot.
- Assemble clean filtlong nanopore reads with Flye assembler.
- Flye-only assembly is then polished with long reads using Medaka, followed by polishing with clean trimmed illumina reads using Polypolish - to generate Flye+Medaka+Polypolish assembly.
- For the short illumina read first approach, Long reads are assembled in hybrid mode using Unicycler - medium mode, followed by polishing with short reads using Polypolish.
- The Flye-only, Flye+Medaka+Polypolish, Unicycler+Polypolish assembles the passes through Prokka for annoatation, BUSCO for assembly completeness statistics and QUAST for assembly statistics.
All the final assembly and annotation are finally placed in prokka directory.
Clone the github directory onto your system.
git clone https://github.com/alipirani88/Nanosake.git
The workflow makes use of singularity containers available from State Public Health - Bioinformatics group. If you are working on UMICH Great Lakes - you can load snakemake and singularity module in the following manner. If you are trying to run it on your local or other computing platform, make sure you have snakemake and singularity installed.
Load bioinformatics and snakemake module from Great Lakes modules.
module load Bioinformatics
module load snakemake singularity
As an input, the snakemake file takes a config file where you can set the path to sample.tsv sample lookup file, path to your long and short reads, path to adapter fasta file etc. sample.tsv should be a comma seperated file, consisting of four columns - barcode_id(barcode id outputted by Minion or can be same as sample_id in case you dont have barcode information),sample_id (genome ID - this should be prefix that should be extracted from your illumina fastq reads),illumina_r1 (forward read 1),illumina_r2 (reverse read 2)
You can create sample.tsv file using the following for loop. It assumes that you are running for loop from the folder that contains your illumina short reads.
echo "barcode_id,sample_id,illumina_r1,illumina_r2" > sample.tsv
for read1 in *_R1.fastq.gz; do sample_id=`echo $i | sed 's/_R1.fastq.gz//g'`; read2=`echo $i | sed 's/_R1.fastq.gz/_R2.fastq.gz/g'`; echo $sample_id,$sample_id,$read1,$read2; done >> sample.tsv
Set path to your sample.tsv, ONT long reads and illumina short reads path in config.yaml file.
snakemake -s Nanosake.smk -p --use-conda -j 999 --cluster "sbatch -A {cluster.account} -p {cluster.partition} -N {cluster.nodes} -t {cluster.walltime} -c {cluster.procs} --mem-per-cpu {cluster.pmem}" --conda-frontend conda --cluster-config config/cluster.json --configfile config/config.yaml --latency-wait 1000
The workflow generates all the output in the output prefix folder set in the config file. Each workflow steps gets its own individual folder as shown below:
results/2024-05-01_Test_singularity/
├── busco
├── filtlong
├── flye
├── medaka
├── nanoplot
├── polypolish
├── polypolish_unicycler
├── prokka
├── quast
├── trimmomatic
└── unicycler
You can generate a MultiQC report by running it on prokka, quast, busco and nanoplot folder.
multiqc --filename multiqc_report_filename -o multiqc_output_dir -f prokka/ busco/ quast/ nanoplot/