AMR-EnsembleNet

Fusing Sequence Motifs and Pan-Genomic Features: Antimicrobial Resistance Prediction using an Explainable Lightweight 1D CNN - XGBoost Ensemble

Official ScikitLearn/TensorFlow/Keras implementation for the paper: "Fusing Sequence Motifs and Pan-Genomic Features: Antimicrobial Resistance Prediction using an Explainable Lightweight 1D CNN - XGBoost Ensemble"

Summary: The rapid and accurate prediction of Antimicrobial Resistance (AMR) from genomic data is a critical challenge in modern medicine. Standard machine learning models often treat the genome as an unordered "bag of features," ignoring the valuable sequential information encoded in the order of Single Nucleotide Polymorphisms (SNPs). Conversely, state-of-the-art sequence models like Transformers are often too data-hungry for the moderately-sized datasets typical in genomic surveillance. We propose AMR-EnsembleNet, a simple yet powerful ensemble framework that synergistically combines the strengths of these two approaches. Our framework fuses a lightweight, custom-tuned 1D Convolutional Neural Network (CNN), designed to learn predictive sequence motifs, with an XGBoost model adept at capturing complex, non-local feature interactions. When trained and evaluated on a benchmark dataset of 809 E. coli isolates, our ensemble model consistently achieves top-tier performance across four antibiotics with varying class imbalance. For the highly challenging Gentamicin (GEN) dataset, the ensemble yields the best overall performance with a Matthews Correlation Coefficient (MCC) of 0.403, a score driven by the CNN's superior ability to recall rare resistant cases. Our results show that fusing complementary sequence-based and feature-based models provides a robust, accurate, and computationally feasible solution for AMR prediction.

Architecture Overview

The AMR-EnsembleNet is a simple soft voting ensemble that combines the prediction probabilities of two powerful, complementary models:

A Sequence-Aware Custom 1D CNN: A deep, lightweight 1D Convolutional Neural Network (CNN) processes the ordered sequence of integer-encoded SNPs. Its hierarchical convolutional layers are designed to learn local patterns and motifs, such as multiple functionally related mutations within a single gene.
A Feature-Based XGBoost Model: An XGBoost classifier operates on the same SNP data but treats it as an unordered "bag of features." This allows it to learn complex, non-linear interactions between SNPs, regardless of their position on the chromosome.

The final prediction is the unweighted/weighted average of the probabilities from these two models, creating a more robust and generalized classifier.

Key Results

A central finding of this research is that no single model is universally superior across all AMR prediction tasks. The best performance is often achieved by an ensemble that leverages the complementary strengths of a sequence-aware deep learning model and a powerful tree-based model, especially on challenging, imbalanced datasets.

Performance on Ciprofloxacin (CIP) - Balanced Dataset

Model	Accuracy	F1 Score (Resist.)	Matthews (MCC)	Precision	Recall	F1 Score (Macro)
Random Forest	0.9568	0.9517	0.9127	0.9583	0.9452	0.9563
XGBoost	0.9630	0.9583	0.9253	0.9718	0.9452	0.9625
1D CNN	0.9568	0.9524	0.9129	0.9459	0.9589	0.9564
AMR-FusionNet (Ensemble)	0.9630	0.9577	0.9260	0.9855	0.9315	0.9624

Performance on Gentamicin (GEN) - Highly Imbalanced Dataset

Model	Accuracy	F1 Score (Resist.)	Matthews (MCC)	Precision	Recall	F1 Score (Macro)
Random Forest	0.7469	0.4938	0.3271	0.4651	0.5263	0.6626
XGBoost	0.7593	0.5301	0.3722	0.4889	0.5789	0.6841
1D CNN	0.7346	0.5567	0.3984	0.4576	0.7105	0.6836
AMR-FusionNet (Ensemble)	0.7469	0.5591	0.4030	0.4727	0.6842	0.6908

The 1D CNN's superior recall on the rare resistant class for Gentamicin is a key finding, which the ensemble successfully leverages to achieve the best overall balanced performance (highest MCC and Macro F1-score).

Setup and Installation

This project is built using TensorFlow and Scikit-learn.

1. Clone the repository:

git clone https://github.com/Saiful185/AMR-EnsembleNet.git
cd AMR-EnsembleNet

2. Install dependencies: It is recommended to use a virtual environment.

pip install -r requirements.txt

Key dependencies include: tensorflow, scikit-learn, shap, xgboost, pandas, numpy, matplotlib.

Dataset

The experiments are run on the publicly available GieSSen dataset. The specific files used are the SNP-matrix (cip_ctx_ctz_gen_multi_data.csv) and the phenotype data (cip_ctx_ctz_gen_pheno.csv). These are included in this repository or can be downloaded from the original ML-iAMR GitHub.

Usage: Running the Experiments

The code is organized into Jupyter/Colab notebooks (.ipynb) for each model and the final ensemble.

Open a notebook.
Update the paths in the first few cells to point to your dataset's location.
Run the cells sequentially to perform data setup, model training, and final evaluation. The training notebooks will save the final models, which are then loaded by the ensembling notebook.

Pre-trained Models

The pre-trained model weights for our key experiments are available for download from the v1.0.0 release on this repository.

Model	Trained For	Description	Download Link
1D CNN	Ciprofloxacin	Our sequence-aware 1D-CNN model for CIP.	Link
XGBoost	Ciprofloxacin	The feature-based baseline for CIP.	Link
Random Forest	Ciprofloxacin	The Random Forest baseline for CIP.	Link
1D CNN	Cefotaxime	Our sequence-aware 1D-CNN model for CTX.	Link
XGBoost	Cefotaxime	The feature-based baseline for CTX.	Link
Random Forest	Cefotaxime	The Random Forest baseline for CTX.	Link
1D CNN	Ceftazidime	Our sequence-aware 1D-CNN model for CTZ.	Link
XGBoost	Ceftazidime	The feature-based baseline for CTZ.	Link
Random Forest	Ceftazidime	The Random Forest baseline for CTZ.	Link
1D CNN	Gentamicin	Our sequence-aware 1D-CNN model for GEN.	Link
XGBoost	Gentamicin	The feature-based baseline for GEN.	Link
Random Forest	Gentamicin	The Random Forest baseline for GEN.	Link

Citation

If you find this work useful in your research, please consider citing our paper:

@article{Siddiqui2025AMREnsembleNet,
	author = {Siddiqui, Md. Saiful Bari and Tarannum, Nowshin},
	title = {Fusing Sequence Motifs and Pan-Genomic Features: Antimicrobial Resistance Prediction using an Explainable Lightweight 1D CNN - XGBoost Ensemble},
	elocation-id = {2025.09.27.678993},
	year = {2025},
	doi = {10.1101/2025.09.27.678993},
	publisher = {Cold Spring Harbor Laboratory},
	URL = {https://www.biorxiv.org/content/early/2025/09/27/2025.09.27.678993 },
	journal = {bioRxiv}
}

License

This project is licensed under the MIT License. See the LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 30 Commits
AMR Ensemble Models		AMR Ensemble Models
Final Custom 1D CNN Implementations		Final Custom 1D CNN Implementations
Random Forest Implementations		Random Forest Implementations
XGBoost Implementations		XGBoost Implementations
figures		figures
AMR_Project_1D_CNN_Tuning.ipynb		AMR_Project_1D_CNN_Tuning.ipynb
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

AMR-EnsembleNet

Architecture Overview

Key Results

Performance on Ciprofloxacin (CIP) - Balanced Dataset

Performance on Gentamicin (GEN) - Highly Imbalanced Dataset

Setup and Installation

Dataset

Usage: Running the Experiments

Pre-trained Models

Citation

License

About

Uh oh!

Releases 1

Packages

Languages

License

Saiful185/AMR-EnsembleNet

Folders and files

Latest commit

History

Repository files navigation

AMR-EnsembleNet

Architecture Overview

Key Results

Performance on Ciprofloxacin (CIP) - Balanced Dataset

Performance on Gentamicin (GEN) - Highly Imbalanced Dataset

Setup and Installation

Dataset

Usage: Running the Experiments

Pre-trained Models

Citation

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Languages

Packages