Capybara, a Core-snp Assignment PYthon tool for Acinetobacter baumannii
Capybara enables you to identify hierarchical populations in epidemic super-lineage (ESL) of Acinetobacter baumannii using a set of core-genome SNPs. For ESL or citation of Capybara, see DOI: 10.21203/rs.3.rs-4129268/v1.
Capybara was devoloped and tested in Python 3.9.0, and requires a several modules:
minimap2
mash
samtools
bcftools
You can easily install these packages using command below:
conda install -c bio-conda samtools bcftools minimap2 mash
Then you can use git to clone Capybara into your PC.
git clone git@github.com:Zhou-lab-SUDA/CAPYBARA.git
$ cd /path/to/Capybara/
$ capy.py -i Examples/2.5.6.fna
It will generate a report file for Examples/2.5.6.fna about its population.
A single run for an assembled genome will finish <3 minutes for a 4 CPUs laptop (>10 minutes for short reads).
$ Usage: capy.py [OPTIONS]
Options:
-i, --query TEXT [Required] Input data, both assembled genome or short reads are acceptable.
-p, --prefix TEXT [Optional] Prefix for output file. Default as Capy.
-t, --threads INTEGER [Optional] Number of process to use. default: 8
-l, --list TEXT [Optional] A file containing list of query files, one per line.
--help Show this message and exit.
Capybara generates a report file in format below:
query | ESL | Lineage | Variant |
---|---|---|---|
2.5.6.fna | True | 2.5 | 2.5.6 |
IC7.fna | False | - | - |
A basic run for Capybara is as follows:
- ESL identification:
- We pre-sketched all 5,824 representative genomes. Genetic distance between query data and pre-sketched data will be evaluated to find the most closed genomes.
- If query data does not contains any sequential information related to ESL genomes, it will be classified as non-ESL. Otherwise, it will be analyzed as follows.
- Sequential alignment:
- Query data will be aligned onto ESL's reference genome (MDR-TJ:GCF_000187205.2) to generate a BAM file.
- SNP calling:
- A series SNPs will be called from BAM and then generate an VCF file.
- Population assignment:
- Using a pre-built SNP scheme to assign hierarchical population of query data.
Workflow chart:
All data required for reproduction of the analysis were distributed in this repository under CAPYBARA/capydb/
which included:
- esl/esl.fna
Reference genome for ESL.
- msh/*.msh
5,824 pre-sketched files by Mash sketch.