This repository provides data and examples that were used for development of DeepBGC and its evaluation with ClusterFinder and antiSMASH.
See https://github.com/Merck/deepbgc for the DeepBGC tool.
Reproduction and storage of data files is managed using DVC (development version 0.22.0
).
Each data file has a .dvc
history file that contains the command that was used to generate the output along with md5 hashes of its dependencies.
- Install python 3, ideally using conda
- Run
pip install -r requirements.txt
to download DVC and other requirements
- Run the AWS config script to generate temporary AWS credentials in ~/.aws/credentials:
generate-aws-config --account lab --insecure
- Run
dvc pull data/path/to/file.dvc
to download required file.
- bgc_detection/ all the code
- data/ all the data
- bacteria/ 3k reference bacteria
- candidates/ novel detected BGC candidates
- clusterfinder/ ClusterFinder (Cimermancic et al.) datasets
- evaluation/ Cross-validation, Leave-Class-Out and Bootstrap evaluation
- features/ Pfam2vec and other protein domain features
- figures/ Paper figures
- mibig/ MIBiG BGC database samples
- models/ Model configurations and trained models
- pfam/ Pfam repository files
- training/ Negative and positive training data and t-SNE visualizations
- bacteria/ 3k reference bacteria
- notebooks/ Jupyter (iPython) notebooks
- Define a JSON config file, see data/models/config for reference.
- Run bgc_detection/run_training.py with given config and path to training data. See DVC files in data/models/trained for reference.
- Trained model will be presented as Python pickle file.
- Prepare a protein FASTA file, e.g. using Prodigal (see data/bacteria/proteins.dvc for reference) or extract it from an annotated GenBank file using bgc_detection/preprocessing/proteins2fasta.py.
- Detect protein domains using Hmmscan (see data/bacteria/domtbl.dvc for reference)
- Convert the Hmmscan domtbl file into a Domain CSV file using bgc_detection/preprocessing/domtbl2csv.py (see data/bacteria/domains.dvc for reference)
- Predict BGC domain-level probability using bgc_detection/run_prediction.py (see data/bacteria/prediction/128lstm-100pfamdim-8pfamiter-posweighted-neg-10k.dvc for reference)
- Threshold and merge domain-level predictions into a BGC candidate CSV file using bgc_detection/candidates/threshold_candidates.py (see [data/bacteria/candidates/128lstm-100pfamdim-8pfamiter-posweighted-neg-10k-fpr2/candidates.csv.dvc] for reference)
See notebooks/LabelledContigBootstrap.ipynb.
See data/evaluation/lco-neg-10k (TODO).
See data/evaluation/cv-10fold-neg-10k (TODO).
See notebooks/CandidateClassification.ipynb and notebooks/CandidateActivityClassification.ipynb