Author | Arthur Morris (ArthurVM) |
Anna Price (annacprice) | |
arthurvmorris@gmail.com | |
License | GNU GPL v3.0 |
A toolkit for performing variant level metageonomics disambiguation of NGS reads.
Installation is quick and easy, simply run
git clone git@github.com:ArthurVM/Afanc.git
cd Afanc
pip3 install ./
Afanc is split into 3 sub-modules:
get_dataset Download a dataset of genome assemblies from GenBank.
autodatabase Generate a database from a FASTA directory structure with autodatabase.
screen High-resolution metagenomic screening of short read data using a database
constructed by autodatabase.
These modules enable the user to download a dataset of genome assemblies using species names or accession IDs with get_dataset
, perform quality-control and construct a Kraken2 database with autodatabase
, and screen reads against this database to produce a high-resolution metagenomic report with screen
.
This is a general ease-of-use module, allowing the user to provide a list of species IDs (e.g. Escherichia coli) or accessions (e.g. NC_000913.3) in a line separated text file to download and deposit within a directory structure which is acceptable to autodatabase. The user can also specify how many assemblies of each species to download (not available if accessions are provided).
Given the text file
Mycobacterium tuberculosis
Mycobacterium tuberculosis variant bovis BCG
Mycobacterium avium
Mycobacterium simiae
the directory structure will be
.
|
├── Mycobacterium_avium
│ ├── assembly_1.fa
│ ├── assembly_2.fa
│ └── assembly_3.fa
├── Mycobacterium_simiae
│ ├── assembly_1.fa
│ ├── assembly_2.fa
│ └── assembly_3.fa
└── Mycobacterium_tuberculosis
├── assembly_1.fa
├── assembly_2.fa
├── assembly_3.fa
└── Mycobacterium_tuberculosis_variant_bovis_BCG
├── assembly_1.fa
├── assembly_2.fa
└── assembly_3.fa
If accessions are provided, they will just be deposited into a single directory.
Autodatabase automates the process of constructing a Kraken2 database. This is a pythonic reimagination of the nextflow pipeline https://github.com/annacprice/autodatabase
This module takes a directory structure as described in above, in the get_dataset section. It must contain directories for each species level taxon, where subdirectories within each species directory pertain to subspecies/variants/strains, or any other taxonomic rank lower than species (hereafter referred to simply as variants).
There are six stages to the workflow of this module:
1) Download the NCBI taxonomy
2) Add the taxonomic ID to the sequence IDs and the filenames
3) Create a mash matrix for each taxon
4) Use the mash matrix to select high quality assemblies for database construction
6) Build the Kraken2 database
7) Create a Krona chart showing the composition of the Kraken2 database
By default, it will use the ncbi database from 2020-05-01. If a species or variant is not found within the ncbi database, Afanc will attempt to add it to the database and assign it an ncbi taxonomy ID.
This module takes a database produced by the autodatabase module, and paired end read data in .fastq format, and performs metagenomic analysis upon it. It produces a report in .json format.
There are five stages to the workflow of this module:
1) Check autodatabase structure is correct
2) Run Kraken2 on the input dataset
3) Parse and filter the K2 report to determine the species contained within the dataset, and the most likely variants ("hits")
4) Map reads to hit assemblies
5) Construct report
Running Afanc should, in general, be done in the order of modules presented above. The get_dataset
module is not necessary if you already have genome assemblies in the directory structure outlined previously.
afanc get_dataset species_list.txt -n 5 -o my_assemblies_dir
This will create a directory structure containing up to 5 (if enough are available on GenBank) assemblies of each species/variant downloaded from GenBank. This can then be fed into the autodatabase module
afanc autodatabase my_assemblies_dir -o my_assemblies_DB
This will create a directory structure, which constitutes the database for screening reads against.
afanc screen my_assemblies_DB my_reads_1.fq.gz my_reads_2.fq.gz -o my_analysis
Results will be deposited in a directory structure within my_analysis
.
Afanc has a number of dependancies which must be satisfied for full functionality. All software must be in PATH.
perl
Python 3.7
Entrez Direct E-utilities
Mash v2.3
Kraken2 v2.1.2
ncbi-blast+
Krona
bowtie2
Install instructions for ncbi datasets can be found at https://www.ncbi.nlm.nih.gov/datasets/docs/v2/download-and-install/
wget https://github.com/marbl/Mash/releases/download/v2.3/mash-Linux64-v2.3.tar \
tar -xf mash-Linux64-v2.3.tar \
mv mash-Linux64-v2.3/mash /usr/local/bin \
apt-get update
apt-get install ncbi-blast+
https://github.com/DerrickWood/kraken2/archive/refs/tags/v2.1.2.tar.gz
wget https://github.com/DerrickWood/kraken2/archive/refs/tags/v2.1.2.tar.gz \
tar -xzf v2.1.2.tar.gz \
cd kraken2-2.1.2 \
./install_kraken2.sh /usr/local/bin
git clone https://github.com/marbl/Krona \
mkdir -p Krona/KronaTools/taxonomy \
cd /Krona/KronaTools \
./install.pl \
./updateTaxonomy.sh
curl -fsSL https://sourceforge.net/projects/bowtie-bio/files/bowtie2/${bowtie2_version}/bowtie2-2.3.4.1-source.zip -o bowtie2-2.3.4.1-source.zip
unzip bowtie2-2.3.4.1-source.zip
make -C bowtie2-2.3.4.1 prefix=/usr/local install
rm -r bowtie2-2.3.4.1
rm bowtie2-2.3.4.1-source.zip
Single species dataset
❯ zcat Myco_sim_data/M1/M1.1.1_{1,2}.fq.gz | wc -l
643960
❯ time afanc screen -o M1.1.1 -t 6 ./Mycobacteriaciae_DB_3.0 Myco_sim_data/M1/M1.1.1_1.fq.gz Myco_sim_data/M1/M1.1.1_2.fq.gz > M1.1.1.afanc.txt
afanc screen -o M1.1.1 -t 6 ./Mycobacteriaciae_DB_3.0 > M1.1.1.afanc.txt 48.46s user 2.87s system 224% cpu 22.836 total
Highly complex compound (multi-species/variant) dataset
❯ zcat Myco_sim_data/M2/M2.1_{1,2}.fq.gz | wc -l
2341696
❯ time afanc screen -o M2.1 -t 6 ./Mycobacteriaciae_DB_3.0 Myco_sim_data/M2/M2.1_1.fq.gz Myco_sim_data/M2/M2.1_2.fq.gz > M2.1.afanc.txt
afanc screen -o M2.1 -t 6 ./Mycobacteriaciae_DB_3.0 > M2.1.afanc.txt 240.89s user 4.71s system 257% cpu 1:35.22 total