Fusing Sequence Motifs and Pan-Genomic Features: Antimicrobial Resistance Prediction using an Explainable Lightweight 1D CNN - XGBoost Ensemble
Official ScikitLearn/TensorFlow/Keras implementation for the paper: "Fusing Sequence Motifs and Pan-Genomic Features: Antimicrobial Resistance Prediction using an Explainable Lightweight 1D CNN - XGBoost Ensemble"
Summary: The rapid and accurate prediction of Antimicrobial Resistance (AMR) from genomic data is a critical challenge in modern medicine. Standard machine learning models often treat the genome as an unordered "bag of features," ignoring the valuable sequential information encoded in the order of Single Nucleotide Polymorphisms (SNPs). Conversely, state-of-the-art sequence models like Transformers are often too data-hungry for the moderately-sized datasets typical in genomic surveillance. We propose AMR-EnsembleNet, a simple yet powerful ensemble framework that synergistically combines the strengths of these two approaches. Our framework fuses a lightweight, custom-tuned 1D Convolutional Neural Network (CNN), designed to learn predictive sequence motifs, with an XGBoost model adept at capturing complex, non-local feature interactions. When trained and evaluated on a benchmark dataset of 809 E. coli isolates, our ensemble model consistently achieves top-tier performance across four antibiotics with varying class imbalance. For the highly challenging Gentamicin (GEN) dataset, the ensemble yields the best overall performance with a Matthews Correlation Coefficient (MCC) of 0.403, a score driven by the CNN's superior ability to recall rare resistant cases. Our results show that fusing complementary sequence-based and feature-based models provides a robust, accurate, and computationally feasible solution for AMR prediction.
The AMR-EnsembleNet is a simple soft voting ensemble that combines the prediction probabilities of two powerful, complementary models:
- A Sequence-Aware Custom 1D CNN: A deep, lightweight 1D Convolutional Neural Network (CNN) processes the ordered sequence of integer-encoded SNPs. Its hierarchical convolutional layers are designed to learn local patterns and motifs, such as multiple functionally related mutations within a single gene.
- A Feature-Based XGBoost Model: An XGBoost classifier operates on the same SNP data but treats it as an unordered "bag of features." This allows it to learn complex, non-linear interactions between SNPs, regardless of their position on the chromosome.
The final prediction is the unweighted/weighted average of the probabilities from these two models, creating a more robust and generalized classifier.
A central finding of this research is that no single model is universally superior across all AMR prediction tasks. The best performance is often achieved by an ensemble that leverages the complementary strengths of a sequence-aware deep learning model and a powerful tree-based model, especially on challenging, imbalanced datasets.
| Model | Accuracy | F1 Score (Resist.) | Matthews (MCC) | Precision | Recall | F1 Score (Macro) |
|---|---|---|---|---|---|---|
| Random Forest | 0.9568 | 0.9517 | 0.9127 | 0.9583 | 0.9452 | 0.9563 |
| XGBoost | 0.9630 | 0.9583 | 0.9253 | 0.9718 | 0.9452 | 0.9625 |
| 1D CNN | 0.9568 | 0.9524 | 0.9129 | 0.9459 | 0.9589 | 0.9564 |
| AMR-FusionNet (Ensemble) | 0.9630 | 0.9577 | 0.9260 | 0.9855 | 0.9315 | 0.9624 |
| Model | Accuracy | F1 Score (Resist.) | Matthews (MCC) | Precision | Recall | F1 Score (Macro) |
|---|---|---|---|---|---|---|
| Random Forest | 0.7469 | 0.4938 | 0.3271 | 0.4651 | 0.5263 | 0.6626 |
| XGBoost | 0.7593 | 0.5301 | 0.3722 | 0.4889 | 0.5789 | 0.6841 |
| 1D CNN | 0.7346 | 0.5567 | 0.3984 | 0.4576 | 0.7105 | 0.6836 |
| AMR-FusionNet (Ensemble) | 0.7469 | 0.5591 | 0.4030 | 0.4727 | 0.6842 | 0.6908 |
The 1D CNN's superior recall on the rare resistant class for Gentamicin is a key finding, which the ensemble successfully leverages to achieve the best overall balanced performance (highest MCC and Macro F1-score).
This project is built using TensorFlow and Scikit-learn.
1. Clone the repository:
git clone https://github.com/Saiful185/AMR-EnsembleNet.git
cd AMR-EnsembleNet2. Install dependencies: It is recommended to use a virtual environment.
pip install -r requirements.txtKey dependencies include: tensorflow, scikit-learn, shap, xgboost, pandas, numpy, matplotlib.
The experiments are run on the publicly available GieSSen dataset. The specific files used are the SNP-matrix (cip_ctx_ctz_gen_multi_data.csv) and the phenotype data (cip_ctx_ctz_gen_pheno.csv). These are included in this repository or can be downloaded from the original ML-iAMR GitHub.
The code is organized into Jupyter/Colab notebooks (.ipynb) for each model and the final ensemble.
- Open a notebook.
- Update the paths in the first few cells to point to your dataset's location.
- Run the cells sequentially to perform data setup, model training, and final evaluation. The training notebooks will save the final models, which are then loaded by the ensembling notebook.
The pre-trained model weights for our key experiments are available for download from the v1.0.0 release on this repository.
| Model | Trained For | Description | Download Link |
|---|---|---|---|
| 1D CNN | Ciprofloxacin | Our sequence-aware 1D-CNN model for CIP. | Link |
| XGBoost | Ciprofloxacin | The feature-based baseline for CIP. | Link |
| Random Forest | Ciprofloxacin | The Random Forest baseline for CIP. | Link |
| 1D CNN | Cefotaxime | Our sequence-aware 1D-CNN model for CTX. | Link |
| XGBoost | Cefotaxime | The feature-based baseline for CTX. | Link |
| Random Forest | Cefotaxime | The Random Forest baseline for CTX. | Link |
| 1D CNN | Ceftazidime | Our sequence-aware 1D-CNN model for CTZ. | Link |
| XGBoost | Ceftazidime | The feature-based baseline for CTZ. | Link |
| Random Forest | Ceftazidime | The Random Forest baseline for CTZ. | Link |
| 1D CNN | Gentamicin | Our sequence-aware 1D-CNN model for GEN. | Link |
| XGBoost | Gentamicin | The feature-based baseline for GEN. | Link |
| Random Forest | Gentamicin | The Random Forest baseline for GEN. | Link |
If you find this work useful in your research, please consider citing our paper:
@article{Siddiqui2025AMREnsembleNet,
author = {Siddiqui, Md. Saiful Bari and Tarannum, Nowshin},
title = {Fusing Sequence Motifs and Pan-Genomic Features: Antimicrobial Resistance Prediction using an Explainable Lightweight 1D CNN - XGBoost Ensemble},
elocation-id = {2025.09.27.678993},
year = {2025},
doi = {10.1101/2025.09.27.678993},
publisher = {Cold Spring Harbor Laboratory},
URL = {https://www.biorxiv.org/content/early/2025/09/27/2025.09.27.678993 },
journal = {bioRxiv}
}This project is licensed under the MIT License. See the LICENSE file for details.
