EGAP Pipeline

Overview

EGAP (Entheome Genome Assembly Pipeline) is a versatile bioinformatics pipeline for hybrid genome assembly from Oxford Nanopore, Illumina, and PacBio data. It supports multiple input modes and assembly methods and determines the best based on multiple metrics: BUSCO Completeness (Single + Duplicated), Assembly Contig Count, Assembly N50, Assembly L50, and Assembly GC-content.

Preprocess & QC Reads
- Merge or detect multiple FASTQ files.
- Perform read trimming and adapter removal (Trimmomatic, BBDuk).
- Deduplicate reads (Clumpify).
- Filter and correct ONT reads (Filtlong, Ratatosk).
- Generate read metrics (FastQC, NanoPlot, BBMap-based insert-size checks).
Assembly
- MaSuRCA - Will be performed with Illumina Only or with Hybrid (ONT or PacBio).
- Flye - Will be performed with ONT only or PacBio only.
- SPAdes - Will be performed with Illumina only or with Hybrid (ONT or PacBio).
- hifiasm - Will be performed with PacBio only.
Assembly Polishing
- Polishes assemblies with Racon (in ONT) and Pilon (if Illumina).
- Removes haplotigs with purge_dups.
Assembly Curation
- Scaffolds and patches using RagTag against a reference genome (if provided).
- Performs final gap-closing with either TGS-GapCloser (if ONT) or ABySS-Sealer (if Illumina-only).
Quality Assessments & Classification
- Runs Compleasm (BUSCO) on two lineages to measure completeness.
- Runs QUAST for contiguity statistics (N50, L50, GC%, etc.).
- Rates the final assembly as AMAZING, GREAT, OK, or POOR based on combined metrics.
- (Optional) Integrates coverage calculations against a reference or final assembly size.

Though optimized for fungal genomes, EGAP can be adapted to many other organisms by switching up lineages, reference sequences, or default thresholds.

Note on Supported Input Modes:
Currently, the pipeline supports only the following combinations:

Illumina input only (SRA, DIR, or RAW FASTQ files)
Illumina input + Reference sequence (either GCA accession or a path to a FASTA file)
Illumina input + ONT input (SRA, DIR, or RAW FASTQ files)
Illumina input + ONT input + Reference sequence
PacBio input only (SRA, DIR, or RAW FASTQ files)
PacBio input only + Reference sequence
Illumina input + PacBio input
Illumina input + PacBio input + Reference sequence
Assembly Only Input (for Assembly QC analysis)

NOTE: it is typically not recommended to use a Reference Sequence when assembling Fungal genomes as it can mask rearrangements.

Future developments include support for:

ONT input only
ONT input + Reference sequence

Installation

The following tools are installed:

The available shell script (e.g., EGAP_setup.sh) can install a;; dependencies (Python 3.8+, Conda, and the main bioinformatics tools):

bash /path/to/EGAP_setup.sh

Alternatively, you can install the Entheome Ecosystem via Docker. Open a terminal in the directory where the Dockerfile is located.

docker build -t entheome_ecosystem .

Run the container (adjust the path accordingly):

docker run -it -v /path/to/data:/path/to/data entheome_ecosystem bash

Inside the Docker container, load the pre-generated EGAP environment:

source /EGAP_env/bin/activate

Anaconda installation in a dedicated environment through the Bioconda channel with the following command:

conda create -y EGAP_env python=3.8 && conda activate EGAP_env && conda install -y -c bioconda egap

Pipeline Flow

Command-Line Usage

Parameters:

--input_csv, -csv (str): Path to a CSV containing multiple sample data. (default = None)
--ont_sra, -osra (str): Oxford Nanopore Sequence Read Archive (SRA) Accession number. (if -csv = None; else None)
--raw_ont_dir, -odir (str): Path to a directory containing all Raw ONT Reads. (if -csv = None; else REQUIRED)
--raw_ont_reads, -i0 (str): Path to the combined Raw ONT FASTQ reads. (if -csv = None; else REQUIRED)
--illu_sra, -isra (str): Illumina Sequence Read Archive (SRA) Accession number. (if -csv = None; else None)
--raw_illu_dir, -idir (str): Path to a directory containing all Raw Illumina Reads. (if -csv = None; else None)
--raw_illu_reads_1, -i1 (str): Path to the Raw Forward Illumina Reads. (if -csv = None; else REQUIRED)
--raw_illu_reads_2, -i2 (str): Path to the Raw Reverse Illumina Reads. (if -csv = None; else REQUIRED)
--pacbio_sra, -psra (str): PacBio Sequence Read Archive (SRA) Accession number. (if -csv = None; else None)
--raw_pacbio_dir, -pdir (str): Path to a directory containing all Raw PacBio Reads. (if -csv = None; else None)
--raw_pacbio_reads, -preads (str): Path to the combined Raw PacBio FASTQ reads. (if -csv = None; else REQUIRED)
--species_id, -ID (str): Species ID formatted as <2-letters of Genus>_<full species name>-<other identifiers>. (if -csv = None; else REQUIRED)
--organism_kingdom, -Kg (str): Kingdom the current organism data belongs to. (default: None)
--organism_karyote, -Ka (str): Karyote type of the organism. (default: None)
--compleasm_1, -c1 (str): Name of the first organism compleasm/BUSCO database to compare to. (default: None)
--compleasm_2, -c2 (str): Name of the second organism compleasm/BUSCO database to compare to. (default: None)
--est_size, -es (str): Estimated size of the genome in Mbp (e.g., 60m). (default: None)
--ref_seq_gca, -rgca (str): Curated Genome Assembly (GCA) Accession number. (default = None)
--ref_seq, -rf (str): Path to the reference genome for assembly. (default: None)
--percent_resources, -R (float): Percentage of resources for processing. (default: 0.90)
--cpu_threads, -T (float): Exact number of CPU threads to use. (default: None)
--ram_gb, -ram, (float): Exact amount of RAM (in GB) to use. (default: None)

Example Command:

EGAP --raw_ont_reads /path/to/ont_reads.fq.gz \ # A combined reads file of all ONT raw reads
     --raw_illu_dir /path/to/illumina_reads/ \ # A folder containing an md5 checksum file and individual Illumina reads files
     --species_id AB_speciesname \
     --organism_kingdom Funga \
     --organism_karyote Eukaryote \
     --compleasm_1 basidiomycota \
     --compleasm_2 agaricales \
     --est_size 60m \
     --percent_resources 0.8

Or, providing SRA numbers (which will download the files into the current working directory):

EGAP --ont_sra SRR######## \
     --illu_sra SRR######## \
     --species_id AB_speciesname \
     --organism_kingdom Funga \
     --organism_karyote Eukaryote \
     --compleasm_1 basidiomycota \
     --compleasm_2 agaricales \
     --est_size 60m \
     --cpu_threads 10 \ # Providing a specific number for CPUs and NOT using percent resources requires the next line
     --ram_gb 32 # Required if NOT using percent resources

Note: Do not use multiple inputs for the same data type (e.g., do NOT use illu_sra and raw_illu_dir simultaneously).

Alternatively, using a CSV file for multiple samples:

EGAP --input_csv /path/to/samples.csv

CSV Generation

To run EGAP with multiple samples, provide a CSV file containing the necessary information for each sample.

CSV Format

The CSV file should have the following header and columns:

ONT_SRA	ONT_RAW_DIR	ONT_RAW_READS	ILLUMINA_SRA	ILLUMINA_RAW_DIR	ILLUMINA_RAW_F_READS	ILLUMINA_RAW_R_READS	PACBIO_SRA	PACBIO_RAW_DIR	PACBIO_RAW_READS	SPECIES_ID	ORGANISM_KINGDOM	ORGANISM_KARYOTE	COMPLEASM_1	COMPLEASM_2	EST_SIZE	REF_SEQ_GCA	REF_SEQ
None	None	None	SRA00000001	None	None	None	None	None	None	Ab_sample1	Funga	Eukaryote	basidiomycota	agaricales	55m	GCA00000001.1	None
None	None	/path/to/ONT/sample1.fq.gz	None	None	/path/to/Illumina/sample1_1.fq.gz	/path/to/Illumina/sample1_2.fq.gz	None	None	None	Ab_sample2	Funga	Eukaryote	basidiomycota	agaricales	60m	None	None
None	None	None	None	None	None	None	None	None	/path/to/pacbio.fastq.gz	Ab_sample3	Funga	Eukaryote	basidiomycota	agaricales	55m	None	None
None	None	None	SRA00000002	None	None	None	None	None	/path/to/pacbio.fastq.gz	Ab_sample4_sub-name	Funga	Eukaryote	basidiomycota	agaricales	55m	GCA00000002.1	None

Column Descriptions

ONT_SRA: Oxford Nanopore Sequence Read Archive (SRA) Accession number. Use None if specifying individual files.
ONT_RAW_DIR: Path to the directory containing all Raw ONT Reads. Use None if specifying individual files.
ONT_RAW_READS: Path to the combined Raw ONT FASTQ reads (e.g., /path/to/ONT/sample1.fq.gz).
ILLUMINA_SRA: Illumina Sequence Read Archive (SRA) Accession number. Use None if specifying individual files.
ILLUMINA_RAW_DIR: Path to the directory containing all Raw Illumina Reads. Use None if specifying individual files.
ILLUMINA_RAW_F_READS: Path to the Raw Forward Illumina Reads (e.g., /path/to/Illumina/sample1_R1.fq.gz).
ILLUMINA_RAW_R_READS: Path to the Raw Reverse Illumina Reads (e.g., /path/to/Illumina/sample1_R2.fq.gz).
PACBIO_SRA: PacBio Sequence Read Archive (SRA) Accession number. Use None if specifying individual files.
PACBIO_RAW_DIR: Path to the directory containing all Raw PacBio Reads. Use None if specifying individual files.
PACBIO_RAW_READS: Path to the combined Raw PacBio FASTQ reads (e.g., /path/to/PACBIO/sample1.fq.gz).
SPECIES_ID: Species ID formatted as <2-letters of Genus>_<full species name>-<other identifiers> (e.g., Ab_sample1, optionally Ab_sample4-sub-name).
ORGANISM_KINGDOM: Kingdom of the organism (default: None).
ORGANISM_KARYOTE: Karyote type of the organism (default: None).
COMPLEASM_1: Name of the first compleasm/BUSCO database (default: None).
COMPLEASM_2: Name of the second compleasm/BUSCO database (default: None).
EST_SIZE: Estimated genome size in Mbp (e.g., None).
REF_SEQ_GCA: Curated Genome Assembly (GCA) Accession number (or None).
REF_SEQ: Path to the reference genome for assembly (or None).

Notes

If you provide a value for ILLUMINA_RAW_DIR, set ILLUMINA_RAW_F_READS and ILLUMINA_RAW_R_READS to None. EGAP will automatically detect and process all paired-end reads within that directory. The same applies for ONT_RAW_DIR.
Ensure that all file paths are correct and accessible.
The CSV file should not contain extra spaces or special characters in the headers.
If you just want to perform QC analysis for an already built assembly: provide the path for the assembly or GCA Accession number to download, in the REF_SEQ or REF_SEQ_GCA field respectively, provide ORGANISM_KARYOTE, and the two compleasm databases (COMPLEASM_1, COMPLEASM_2) to use; DO NOT PROVIDE ESTIMATED SIZE (EST_SIZE).

Example Data & Instructions

Create the Folder Structure

First, create the main processing folder with the required sub-folders. Adjust "EGAP_Processing" as needed:

mkdir -p /path/to/EGAP/EGAP_Processing/

Example CSV File

EGAP_test.csv is included in this repository to run test examples, to run all four files takes about 24hours on a 16 thread, 64GB system.

ONT_SRA,ONT_RAW_DIR,ONT_RAW_READS,ILLUMINA_SRA,ILLUMINA_RAW_DIR,ILLUMINA_RAW_F_READS,ILLUMINA_RAW_R_READS,PACBIO_SRA,PACBIO_RAW_DIR,PACBIO_RAW_READS,SPECIES_ID,ORGANISM_KINGDOM,ORGANISM_KARYOTE,COMPLEASM_1,COMPLEASM_2,EST_SIZE,REF_SEQ_GCA,REF_SEQ
None,None,None,None,None,None,None,SRP093873,None,None,Pa_papilionaceus,Funga,eukaryote,basidiomycota,agaricales,60m,None,None
None,None,None,SRR13870683,None,None,None,None,None,None,Ps_cubensis-B+,Funga,eukaryote,basidiomycota,agaricales,60m,GCF_017499595.1,None
SRR25920759,None,None,SRR25920760,None,None,None,None,None,None,Ps_semilanceata,Funga,eukaryote,basidiomycota,agaricales,60m,None,None
None,None,None,None,None,None,None,None,None,None,De_subviscida,Funga,eukaryote,basidiomycota,agaricales,None,GCA_013368295.1,None

Local Data

If you are providing your own data locally, be sure to have a species folder and *if needed a sub-folder matching your Species ID: Example: Illumina Only data for Psilocybe cubensis B+ with reference sequence of Psilocybe cubensis

/path/to/EGAP/EGAP_Processing/Ps_cubensis/Ps_cubensis_B+/Illumina/f_reads.fastq.gz
/path/to/EGAP/EGAP_Processing/Ps_cubensis/Ps_cubensis_B+/Illumina/r_reads.fastq.gz)
/path/to/EGAP/EGAP_Processing/Ps_cubensis/ref_seq.fasta

If no sub-folder for sub-species is needed then place everything in the main species folder i.e.:

/path/to/EGAP/EGAP_Processing/Ps_semilanceata/Illumina/f_reads.fastq.gz

Illumina-Only (with Reference Sequence) Assembly Command

EGAP --illu_sra SRR13870683 \
     --species_id Ps_cubensis_B+ \
     --organism_kingdom Funga \
     --organism_karyote eukaryote \
     --compleasm_1 agaricales \
     --compleasm_2 basidiomycota \
     --est_size 700m \
     --ref_seq_gca GCF_017499595.1

ONT/Illumina Hybrid Assembly Command

EGAP --ont_sra SRR25920759 \
     --illu_sra SRR25920760 \
     --species_id Ps_semilanceata \
     --organism_kingdom Funga \
     --organism_karyote eukaryote \
     --compleasm_1 agaricales \
     --compleasm_2 basidiomycota \
     --est_size 60m

PacBio-Only (no Reference Sequence) Assembly Command

EGAP --pacbio_sra SRP093873 \
     --species_id Pa_papilionaceus \
     --organism_kingdom Funga \
     --organism_karyote eukaryote \
     --compleasm_1 agaricales \
     --compleasm_2 basidiomycota \
     --est_size 60m

Quality Control Output Review

EGAP generates final assemblies along with:

QUAST metrics (contig count, N50, L50, GC%, coverage)
Compleasm (BUSCO) plots showing Single, Duplicated, Fragmented & Missing scores.
Final assembly classification: AMAZING, GREAT, OK, or POOR

Statistics Thresholds

The current thresholds for each metric classification (subject to change) are:

first_compleasm_c = {"AMAZING": >98.5, "GREAT": >95.0, "OK": >80.0, "POOR": <80.0}
second_compleasm_c = {"AMAZING": >98.5, "GREAT": >95.0, "OK": >80.0, "POOR": <80.0}
contigs_thresholds = {"AMAZING": <100, "GREAT": <1000, "OK": <10000, "POOR": >10000}
n50_thresholds = {"AMAZING": >1000000, "GREAT": >100000, "OK": >1000, "POOR": <1000}
l50_thresholds = {"AMAZING": #, "GREAT": #, "OK": #, "POOR": #} (still determining best metrics)

Compleasm BUSCO Plots

BUSCO outputs are evaluated based on:

Greater than 98.5% Completion (sum of Single and Duplicated genes) for an AMAZING/Great Assembly
Greater than 95.0% Completion for a Good Assembly
Greater than 80% Completion for an OK Assembly
Less than 80% Completion for a POOR Assembly

Additionally, fewer contigs aligning to BUSCO genes is preferable. Contigs with only duplicated genes are excluded from the plot (noted in the x-axis label).

Illumina-Only (with Reference Sequence) Assembly BUSCO Plots

**Ps. cubensis B+ agaricales BUSCO**

**Ps. cubensis B+ basidiomycota BUSCO**

ONT/Illumina Hybrid Assembly BUSCO Plots

**Ps. semilanceata agaricales BUSCO**

**Ps. semilanceata basidiomycota BUSCO**

PacBio-Only (no Reference Sequence) Assembly BUSCO Plots

**Pa. papilionaceus agaricales BUSCO**

**Pa. papilionaceus basidiomycota BUSCO**

Future Improvements

Automated Quality Assessment Reports: Generate comprehensive quality reports post-assembly for easier analysis.
Improved Data Management: Automated removal of excess files upon pipeline completion.
Enhanced Support for Diverse Genomes: Optimize pipeline parameters for non-fungal genomes to improve versatility.
Improved Error Handling: Develop robust error detection and user-friendly feedback.
Integration with Additional Sequencing Platforms:
- Support for ONT input only and ONT input with Reference sequence.

References

This pipeline was modified from two of the following pipelines:

Bollinger IM, Singer H, Jacobs J, Tyler M, Scott K, Pauli CS, Miller DR, Barlow C, Rockefeller A, Slot JC, Angel-Mosti V. High-quality draft genomes of ecologically and geographically diverse Psilocybe species. Microbiol Resour Announc 0:e00250-24; doi: 10.1128/mra.00250-24

Muñoz-Barrera A, Rubio-Rodríguez LA, Jáspez D, Corrales A, Marcelino-Rodriguez I, Lorenzo-Salazar JM, González-Montelongo R, Flores C. Benchmarking of bioinformatics tools for the hybrid de novo assembly of human whole-genome sequencing data.
bioRxiv 2024.05.28.595812; doi: 10.1101/2024.05.28.595812

The example data are published in:

Bollinger IM, Singer H, Jacobs J, Tyler M, Scott K, Pauli CS, Miller DR, Barlow C, Rockefeller A, Slot JC, Angel-Mosti V. High-quality draft genomes of ecologically and geographically diverse Psilocybe species. Microbiol Resour Announc 0:e00250-24; doi: 10.1128/mra.00250-24

McKernan K, Kane L, Helbert Y, Zhang L, Houde N, McLaughlin S. A whole genome atlas of 81 Psilocybe genomes as a resource for psilocybin production. F1000Research 2021, 10:961; doi: 10.12688/f1000research.55301.2

Ruiz‐Dueñas FJ, Barrasa JM, Sánchez‐García M, Camarero S, Miyauchi S, Serrano A, Linde D, Babiker R, Drula E, Ayuso‐Fernández I, Pacheco R, Padilla G, Ferreira P, Barriuso J, Kellner H, Castanera R, Alfaro M, Ramírez L, Pisabarro AG, Riley R, Kuo A, Andreopoulos W, LaButti K, Pangilinan J, Tritt A, Lipzen A, He G, Yan M, Ng V, Grigoriev IV, Cullen D, Martin F, Rosso M, Henrissat B, Hibbett D, Martínez AT. Genomic Analysis Enlightens Agaricales Lifestyle Evolution and Increasing Peroxidase Diversity. Molecular Biology and Evolution. 38(4): 1428-1446 (2020). 10.1093/molbev/msaa301.

Floudas D, Bentzer J, Ahrén D, Johansson T, Persson P, Tunlid A. Uncovering the hidden diversity of litter-decomposition mechanisms in mushroom-forming fungi. ISME J 14, 2046–2059 (2020). 10.1038/s41396-020-0667-6.

Contribution

If you would like to contribute to the EGAP Pipeline, please submit a pull request or open an issue on GitHub. For major changes, please discuss via an issue first.

License

This project is licensed under the BSD 3-Clause License.

Name		Name	Last commit message	Last commit date
Latest commit History 285 Commits
resources		resources
Dockerfile		Dockerfile
EGAP.py		EGAP.py
EGAP_setup.sh		EGAP_setup.sh
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

EGAP Pipeline

Overview

Table of Contents

Installation

Pipeline Flow

Command-Line Usage

Parameters:

Example Command:

CSV Generation

CSV Format

Column Descriptions

Notes

Example Data & Instructions

Create the Folder Structure

Example CSV File

Local Data

Illumina-Only (with Reference Sequence) Assembly Command

ONT/Illumina Hybrid Assembly Command

PacBio-Only (no Reference Sequence) Assembly Command

Quality Control Output Review

Statistics Thresholds

Compleasm BUSCO Plots

Illumina-Only (with Reference Sequence) Assembly BUSCO Plots

ONT/Illumina Hybrid Assembly BUSCO Plots

PacBio-Only (no Reference Sequence) Assembly BUSCO Plots

Future Improvements

References

Contribution

License

About

Releases 11

Packages

Languages

License

iPsychonaut/EGAP

Folders and files

Latest commit

History

Repository files navigation

EGAP Pipeline

Overview

Table of Contents

Installation

Pipeline Flow

Command-Line Usage

Parameters:

Example Command:

CSV Generation

CSV Format

Column Descriptions

Notes

Example Data & Instructions

Create the Folder Structure

Example CSV File

Local Data

Illumina-Only (with Reference Sequence) Assembly Command

ONT/Illumina Hybrid Assembly Command

PacBio-Only (no Reference Sequence) Assembly Command

Quality Control Output Review

Statistics Thresholds

Compleasm BUSCO Plots

Illumina-Only (with Reference Sequence) Assembly BUSCO Plots

ONT/Illumina Hybrid Assembly BUSCO Plots

PacBio-Only (no Reference Sequence) Assembly BUSCO Plots

Future Improvements

References

Contribution

License

About

Resources

License

Stars

Watchers

Forks

Releases 11

Packages 0

Languages

Packages