Metagenomic assembly

WDL Workflow for metagenome assembly: Python script to generate a mapping between non-redundant gene catalog and MAGS

How does this work?

The wrapper scripts in Python (located in src) will prepare files and send them to Cromwell. Cromwell executes instructions written in Workflow definition Language (WDL; located in src/wdl). To avoid dependency conflicts Cromwell runs Docker containers with preinstalled software (dockerfiles located in docker).

Introduction to WDL workflow

This pipeline will perform:

Pre-processing of reads with Kneaddata
Metagenomics assembly with Megahit
Gene prediction
Mapping of reads against the contigs
Metagenome binning using MetaBAT2
Quality assessment of genome bins
Taxonomic classifications
Gene clustering with CD-HIT-EST
Mapping of reads to gene clusters and computing gene counts
Functional annotation of proteins using both eggNOG-mapper and DeepFRI
Mapping between MAGs and functionally annotated gene catalogue

Requirements

Docker
conda for building the environment
Python

1. Installation

1. Clone the repository

git clone www.github.com/crusher083/metagenome_assembly

2. Create a conda environment

conda env create -f pipeline.yml

3. Install Cromwell

Use the setup_cromwell.py script to download and install it. - python src/setup_cromwell.py --save_path SAVE_PATH

2. Run the pipeline!

Attention: The pipeline was tested on two samples with 4.3 Gb and 2.7 Gb in 2 parallel jobs, 16 CPU cores and 64 Gb RAM each. Time may vary significantly depending on your system and sequencing depth

1. Quality control and assembly

This step will perform quality control of your reads with Kneaddata and assemble quality-controlled reads into contigs using MegaHIT.

Requirements
- input_folder - path to directory with paired shotgun sequencing files. (fastq.gz, fastq, fq.gz, fq)
- bt2_index - path to a directory with a Bowite2 index. In case the folder doesn't contain an index, the user would be proposed to download GRCh38 index used for the decontamination of metagenomic samples from human DNA.
- output_folder - path to a directory where the results will be saved.
Optional arguments
- thread_num - number of threads to use. (default: 1)
- concurrent_jobs - number of concurrent jobs to run. (default: 1)
Output
- quality controlled .fastq.gz files in OUTPUT_FOLDER
- assembled contigs in OUTPUT_FOLDER/assemble.
- count table with read counts per sample OUTPUT_FOLDER/kneaddata_count_table.tsv.

# Qualirty control raw reads and assemble contigs 
python src/qc_and_assemble.py -i INPUT_FOLDER -o OUTPUT_FOLDER \
-bt2_index FOLDER_WITH_BT2_INDEX \
-t 8 -c 3

INFO:root:Treating /storage/TomaszLab/vbez/metagenomic_gmhi/metagenomome_assembly/databases/GRCh38_bt2 as directory with bowtie2 index.
INFO:root:I inferred that _1 and _2 distinguish paired end reads.
INFO:root:Found samples: 2
DEBUG:root:Creating output directory: qc_assemble_out
DEBUG:root:Creating output directory: qc_assemble_out/system
[17:09:48] Workflow started succesfully. Please, be patient.  
[17:16:09] Workflow finished successfully.

Then pipeline forks into two branches - taxonomical and functional

T - Taxonomical annotation

T1 - MAG assembly and taxonomic classification

This step will bin contigs using MetaBAT2, check bins for quality and contamination using CheckM and assign taxonomical classification for MAGs using GTDB.

Requirements
- input_folder_reads - a path to a directory with QCed reads (located in OUTPUT_FOLDER/ of qc and assembly step).
- input_folder_contigs - a path to a directory with assembled contigs (located in OUTPUT_FOLDER/assemble of qc and assembly step).
- gtdbtk_data - a path to a directory with a GTDB-Tk database release. In case the folder doesn't contain the data, it will be downloaded automatically.
- output_folder - a path to a directory where the results will be saved.
Optional arguments
- thread_num - number of threads to use. (default: 1)
- concurrent_jobs - number of concurrent jobs to run. (default: 1)
- suffix - suffix, that helps to identify contigs and preserve consistent filenames. (default: .min500.contigs.fa)
- suffix1 - suffix, that helps to identify forward reads. (default: _paired_1.fastq.gz)
- suffix2 - suffix, that helps to identify reverse reads. (default: _paired_2.fastq.gz)
Output
- SAMPLE_NAME.gff - feature table in Genbank table.
- SAMPLE_NAME.fna - nucleotide sequences for genes in FASTA.
- SAMPLE_NAME.faa - protein translations for genes in FASTA.

# Bin, check and taxonomically classify MAGs
python src/t1_predict_mags.py -ir INPUT_FOLDER_READS -s1 _paired_1.fastq.gz -s2 -s1 _paired_2.fastq.gz \ 
-ic INPUT_FOLDER_CONTIGS -s .min500.contigs.fa \
-gtdb ../databases/gtdbtk-data/ -o OUTPUT_FOLDER \
-t 24 -c 2

DEBUG:root:Creating output directory: out_mags
DEBUG:root:Creating output directory: out_mags/system
[17:23:29] Workflow started succesfully. Please, be patient.  
[17:40:13] Workflow finished successfully.

F - Functional annotation

F1 - Gene prediction

This step will perform gene recognition using Prodigal.

Requirements
- input_folder - a path to a directory with assembled contigs (located in OUTPUT_FOLDER/assemble of the qc and assembly step).
- output_folder - a path to a directory where the results will be saved.
Optional arguments
- concurrent_jobs - number of concurrent jobs to run. (default: 1)
- suffix - suffix, that helps to identify contigs and preserve consistent filenames. (default: .min500.contigs.fa)
Output
- SAMPLE_NAME.gff - feature table in Genbank table.
- SAMPLE_NAME.fna - nucleotide sequences for genes in FASTA.
- SAMPLE_NAME.faa - protein translations for genes in FASTA.

# Predict genes
python src/f1_predict_genes.py -i INPUT_FOLDER -s .min500.contigs.fa -o OUTPUT_FOLDER   \
-c 3

DEBUG:root:Creating output directory: OUTPUT_FOLDER
DEBUG:root:Creating output directory: OUTPUT FOLDER/system
[15:19:43] Workflow started succesfully. Please, be patient.
[15:21:29] Workflow finished successfully.

F2 - Gene clustering

This step will cluster genes using CD-HIT and sequence similarity threshold.

Requirements
- input_folder - a path to a directory with predicted nucleotide sequences of genes (OUTPUT_FOLDER/*.fna of the previous step).
- output_folder - a path to a directory where the results will be saved.
Optional arguments
- thread_num - number of threads. (default: 1)
- suffix - suffix, that helps to identify contigs and preserve consistent filenames. (default: .fna)
Output
- gene_catalogue_split - gene cataloge split in chunks of 10,000 sequences for further analysis.
- combined_genepredictions.sorted.fna - combined predictions of complete genes sorted by length.
- nr.fa - full gene catalogue.
- nr.fa.clstr - clustered genes.
- kma_db.tar.gz - KMA database - required for quantification of gene copies in bacterial genomes (next step).

# Cluster genes
python src/f2_generate_gene_catalog.py -i INPUT_FOLDER -s .fna -o OUTPUT_FOLDER \
-t 16

DEBUG:root:Creating output directory: OUTPUT_FOLDER
DEBUG:root:Creating output directory: OUTPUT FOLDER/system
[15:23:26] Workflow started succesfully. Please, be patient.
[15:24:56] Workflow finished successfully.

F3 - Map to gene clusters

This step will quantify the number of gene clusters in sequenced reads aligning it to the reference using KMA.

Requirements
- input_folder - a path to a directory with quality-controlled reads (from the qc_and_assembly step).
- database - a path to a KMA database. (from F2 - Gene clustering step)
- output_folder - a path to a directory where the results will be saved.
Optional arguments
- suffix1 - suffix, that helps to identify forward reads. (default: _paired_1.fastq.gz)
- suffix2 - suffix, that helps to identify reverse reads. (default: _paired_2.fastq.gz)
- thread_num - number of threads. (default: 1)
Output
- SAMPLE_NAME.kma.res - KMA full output.
- SAMPLE_NAME.geneCPM.txt - table with extracted and normalized gene counts (count per million).

# Quantify gene clusters
python src/f3_generate_gene_catalog.py -i INPUT_FOLDER -s1 _paired_1.fastq.gz -s2 _paired_2.fastq.gz \
-db F2_OUTPUT_FOLDER/kma_db.tar.gz \
-o OUTPUT_FOLDER \
-t 16

DEBUG:root:Creating output directory: OUTPUT_FOLDER
DEBUG:root:Creating output directory: OUTPUT FOLDER/system
[15:26:48] Workflow started succesfully. Please, be patient.
[15:29:08] Workflow finished successfully.

F4 - Annotate gene catalog

This step will provide functional annotation of gene clusters from both eggNOG-mapper and DeepFRI.

Requirements
- input_folder - a path to a directory with gene catalog split into chunks of 10,000 reads (from F2 - gene clustering step).
- eggnog_database - a path to an eggNOG-mapper database. If it is not located in the folder, the necessary files will be downloaded automatically.
- output_folder - a path to a directory where the results will be saved.
Optional arguments
- suffix - suffix, that helps to gene catalog chunks. (default: .fa)
- thread_num_ - number of threads. (default: 1)
- concurrent_jobs - number of jobs to run in parallel. A single DeepFRI job requires 55GB of RAM, too many jobs may result in an out-of-memory error.
Output
- deepfri_annotations.csv - DeepFRI functional annotation for a gene catalog.
- nr-eggnog.emapper.annotations - eggNOG-mapper functional annotation for a gene catalog.
- nr-eggnog.emapper.seed_orthologs- a file with the results from parsing the hits. Each row links a query with a seed ortholog.

# Annotate gene catalog
python src/f4_annotate_gene_catalog.py 
-i F2_OUTPUT_FOLDER/gene_catalog_split/ -s .fa \
-db eggNOG-DATABASE \
-o OUTPUT_FOLDER \
-t 16 -c 2

INFO:root:Treating /storage/TomaszLab/vbez/metagenomic_gmhi/metagenomome_assembly/databases/eggnog-data as directory with eggNOG.
INFO:root:Treating /storage/TomaszLab/vbez/metagenomic_gmhi/metagenomome_assembly/databases/eggnog-data as directory with Diamond.
DEBUG:root:Creating output directory: OUTPUT_FOLDER
DEBUG:root:Creating output directory: OUTPUT_FOLDER/system
[15:56:27] Workflow started succesfully. Please, be patient.                                            
[16:47:31] Workflow finished successfully.

Generate final output

This step will collect all the output into one table.

Requirements
- contig_folder - a path to a directory with contigs from the qc_and_assembly step.
- bins_folder - a path to a directory with bins from the T1 - MAGs binning step.
- gtdbtk_folder - a path to a directory with GTDB-Tk results from the T1 - MAGs binning step.
- checkm_folder - a path to a directory with CheckM results from the T1 - MAGs binning step.
- gene_catalog - a path to a gene catalog file from the F2 - gene clustering step.
- gene_cluster_file - a path to a file with gene clusters.
- eggnog_annotations - a path to a file with eggNOG-mapper annotations.
- deepfri_annotations - a path to a file with DeepFRI annotations.
- output_folder - a path to a directory where the results will be saved.
Outputs
- _individual_mapped_genes.tsv - genes clusters mapped to MAGs.
- _MAGS.tsv - MAGs summary from GTDB-tk and CheckM.
- _mapped_genes_cluster.tsv - eggNOG-mapper annotations for gene clusters.
- merged_eggnog_output.tsv - eggNOG-mapper annotations for gene clusters.
- merged_deepfri_output.tsv - DeepFRI annotations for gene clusters.

# Generate final output
python ../src/generate_table.py \
-c qc_assemble_out/assemble \
-b t1_output \
-g t1_output \
-cm t1_output \
-gcf f2_output/nr.fa.clstr \
-gc f2_output/nr.fa \
-ea f4_output \
-dfa f4_output \
-o final_out

Name		Name	Last commit message	Last commit date
Latest commit History 246 Commits
docker		docker
docs		docs
example_output		example_output
src		src
tests/final_out		tests/final_out
.gitignore		.gitignore
README.md		README.md
general-principle.drawio		general-principle.drawio
metagenomics-pipeline.drawio		metagenomics-pipeline.drawio
metagenomics-pipeline.drawio.png		metagenomics-pipeline.drawio.png
pipeline.yml		pipeline.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Metagenomic assembly

How does this work?

Introduction to WDL workflow

This pipeline will perform:

Requirements

1. Installation

1. Clone the repository

2. Create a conda environment

3. Install Cromwell

2. Run the pipeline!

Attention: The pipeline was tested on two samples with 4.3 Gb and 2.7 Gb in 2 parallel jobs, 16 CPU cores and 64 Gb RAM each. Time may vary significantly depending on your system and sequencing depth

1. Quality control and assembly

Then pipeline forks into two branches - taxonomical and functional

T - Taxonomical annotation

T1 - MAG assembly and taxonomic classification

F - Functional annotation

F1 - Gene prediction

F2 - Gene clustering

F3 - Map to gene clusters

F4 - Annotate gene catalog

Generate final output

About

Releases

Packages

Languages

bioinf-mcb/metagenome_assembly-1

Folders and files

Latest commit

History

Repository files navigation

Metagenomic assembly

How does this work?

Introduction to WDL workflow

This pipeline will perform:

Requirements

1. Installation

1. Clone the repository

2. Create a conda environment

3. Install Cromwell

2. Run the pipeline!

Attention: The pipeline was tested on two samples with 4.3 Gb and 2.7 Gb in 2 parallel jobs, 16 CPU cores and 64 Gb RAM each. Time may vary significantly depending on your system and sequencing depth

1. Quality control and assembly

Then pipeline forks into two branches - taxonomical and functional

T - Taxonomical annotation

T1 - MAG assembly and taxonomic classification

F - Functional annotation

F1 - Gene prediction

F2 - Gene clustering

F3 - Map to gene clusters

F4 - Annotate gene catalog

Generate final output

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages