This repository provides data and examples that were used for development of DeepBGC and its evaluation with ClusterFinder and antiSMASH.
See for the DeepBGC tool.
Reproduction and storage of data files is managed using DVC (development version 0.22.0
Each data file has a .dvc
history file that contains the command that was used to generate the output along with md5 hashes of its dependencies.
- Install python 3, ideally using conda
- Run
pip install -r requirements.txt
to download DVC and other requirements
- Run the AWS config script to generate temporary AWS credentials in ~/.aws/credentials:
generate-aws-config --account lab --insecure
- Run
dvc pull data/path/to/file.dvc
to download required file.
- bgc_detection/ all the code
- data/ all the data
- bacteria/ 3k reference bacteria
- candidates/ novel detected BGC candidates
- clusterfinder/ ClusterFinder (Cimermancic et al.) datasets
- evaluation/ Cross-validation, Leave-Class-Out and Bootstrap evaluation
- features/ Pfam2vec and other protein domain features
- figures/ Paper figures
- mibig/ MIBiG BGC database samples
- models/ Model configurations and trained models
- pfam/ Pfam repository files
- training/ Negative and positive training data and t-SNE visualizations
- bacteria/ 3k reference bacteria
- notebooks/ Jupyter (iPython) notebooks
- Define a JSON config file, see data/models/config for reference.
- Run bgc_detection/ with given config and path to training data. See DVC files in data/models/trained for reference.
- Trained model will be presented as Python pickle file.
- Prepare a protein FASTA file, e.g. using Prodigal (see data/bacteria/proteins.dvc for reference) or extract it from an annotated GenBank file using bgc_detection/preprocessing/
- Detect protein domains using Hmmscan (see data/bacteria/domtbl.dvc for reference)
- Convert the Hmmscan domtbl file into a Domain CSV file using bgc_detection/preprocessing/ (see data/bacteria/domains.dvc for reference)
- Predict BGC domain-level probability using bgc_detection/ (see data/bacteria/prediction/128lstm-100pfamdim-8pfamiter-posweighted-neg-10k.dvc for reference)
- Threshold and merge domain-level predictions into a BGC candidate CSV file using bgc_detection/candidates/ (see [data/bacteria/candidates/128lstm-100pfamdim-8pfamiter-posweighted-neg-10k-fpr2/candidates.csv.dvc] for reference)
See notebooks/LabelledContigBootstrap.ipynb.
See data/evaluation/lco-neg-10k (TODO).
See data/evaluation/cv-10fold-neg-10k (TODO).
See notebooks/CandidateClassification.ipynb and notebooks/CandidateActivityClassification.ipynb