Exploring the AMR protein embedding landscape with protein language models
Protein language models learn sequence representations from evolutionary data without any functional supervision. EmbedAMR asks a simple question: do these representations naturally organise antimicrobial resistance proteins by their biological properties?
5,029 AMR proteins from CARD v3.3.0 are embedded using ESM2-650M and Ankh-base, projected to 2D and 3D UMAP, and evaluated with quantitative metrics. 187 wild-type penicillin-binding proteins from UniProt serve as biological controls.
Try the tool: huggingface.co/spaces/pranavathiyani/EmbedAMR
Coarse mechanism labels do not map cleanly onto embedding geometry. Overall silhouette scores are negative (ESM2: -0.129, Ankh: -0.133), driven by the antibiotic inactivation class which covers 80% of CARD across structurally unrelated enzyme folds. This reflects a mismatch between ARO ontology granularity and the finer-grained structure PLMs capture -- not a model failure.
Drug-class-specific clusters show meaningful coherence. Cephamycin resistance proteins cluster tightly (ESM2: 0.779, Ankh: 0.833), consistent with the structural coherence of AmpC-type enzymes. Glycopeptide resistance proteins also separate well (ESM2: 0.216, Ankh: 0.454).
ESM2 and Ankh partition AMR space differently. The two models agree on 53% of local protein neighbourhoods (K=10, PCA50), indicating model-specific signal worth exploiting for ensemble approaches.
Wild-type PBPs reveal an unexpected sub-zone structure. Without any labels, WT PBPs separate into two groups: soluble transpeptidases landing near Class C AmpC peripheral sequences, and membrane biosynthesis proteins co-locating with mecA and BlaZ resistance genes. Class A beta-lactamases -- often cited as PBP descendants -- are 185-fold more distant from wild-type PBPs than from each other, suggesting PLM embeddings reflect deep evolutionary divergence rather than fold-level proximity.
The interactive app lets you paste any protein sequence, embed it with ESM2-650M, and find its 10 nearest neighbours in the precomputed landscape. Results include a confidence tier (high / moderate / low) based on cosine distance, a plain-language interpretation of the predicted mechanism, and an interactive 3D map showing where your sequence lands.
Sequences that do not resemble known AMR proteins -- like hormones or housekeeping enzymes -- are correctly flagged as low confidence rather than silently assigned a wrong class.
Note: predictions are based on similarity to known AMR proteins in CARD. Novel resistance mechanisms or proteins outside this database may not be identified correctly.
Open EmbedAMR.ipynb in Google Colab with a T4 GPU. The notebook runs
end to end in approximately 2 hours. Embedding steps checkpoint to Google
Drive so disconnects are safe.
| Cell | Description |
|---|---|
| 1 | GPU check |
| 2 | Install dependencies |
| 3 | Drive mount and checkpoint config |
| 4 | Download CARD v3.3.0 |
| 5 | Parse FASTA headers and join ARO metadata |
| 6 | Mechanism labels, balanced evaluation sets, control tagging |
| 7 | Fetch wild-type PBPs from UniProt |
| 8 | Combine CARD and WT PBP datasets |
| 9 | ESM2-650M embeddings (~90 min on T4) |
| 10 | Ankh-base embeddings (~45 min on T4) |
| 11 | Sequence deduplication check |
| 12 | PCA, 2D UMAP, 3D UMAP |
| 13 | Quantitative evaluation |
| 14 | Save app assets |
| 15 | Generate figures and Table 1 |
| 16 | Zip and export |
| Source | Version | Sequences | Purpose |
|---|---|---|---|
| CARD broadstreet | v3.3.0 | 5,029 | AMR reference proteins |
| UniProt Swiss-Prot | KW-0573, bacteria, reviewed | 199 (187 after dedup) | Wild-type PBP controls |
All data is fetched programmatically inside the notebook. No manual downloads needed. 12 UniProt sequences removed after exact and near-identical sequence matching against CARD.
| Model | Parameters | Embedding dim | Source |
|---|---|---|---|
| ESM2-650M | 652M | 1,280 | facebook/esm2_t33_650M_UR50D |
| Ankh-base | 453M | 768 | ElnaggarLab/ankh-base |
Mean pooling of the last hidden state, excluding CLS and EOS tokens. Sequences longer than 1,022 aa use sliding window embedding (window: 1,022, stride: 256, length-weighted averaging).
Dimensionality reduction: StandardScaler, PCA (50 components, 89.7% variance), UMAP (n_neighbors=30, cosine metric, random_state=42).
| Metric | ESM2 | Ankh |
|---|---|---|
| Silhouette L1 mechanism (UMAP 2D) | -0.129 | -0.133 |
| Silhouette L2 drug class (UMAP 2D) | -0.235 | -0.192 |
| Silhouette cephamycin cluster | 0.779 | 0.833 |
| Silhouette glycopeptide cluster | 0.216 | 0.454 |
| Neighbour overlap K=10 (PCA50) | 0.530 | -- |
| WT PBP to Class A BLA distance ratio | 185x | -- |
| WT PBP to CARD mutant PBP distance | 2.521 | -- |
Positive controls
- TEM allele panel (205 sequences): within-family coherence
- Beta-lactamase classes A/B/C/D: mechanism-level separation
- MFS/RND/ABC efflux pump families: structural family separation
Negative controls
- Wild-type PBPs (187 sequences): sensitivity targets, not resistance genes
- WT PBP vs Class A BLA: shared fold ancestry, functionally opposite roles
EmbedAMR/
├── EmbedAMR.ipynb # reproducible notebook
├── requirements.txt
├── README.md
├── app/
│ ├── app.py # HuggingFace Spaces source
│ └── requirements.txt
└── results/
├── figures/ # interactive HTML panels
│ ├── panel1_esm2_mechanism.html
│ ├── panel2_ankh_mechanism.html
│ ├── panel3_bla_classes_wt_pbp.html
│ ├── panel4_efflux_families.html
│ └── EmbedAMR_3D_BLA_WT_PBP.html
└── table1_quantitative_results.csv
Precomputed embedding files (.npy, .pkl) are not tracked due to size. They are generated by the notebook and saved to Google Drive.
@misc{pranavathiyani2026embedamr,
author = {Pranavathiyani G},
title = {EmbedAMR: Exploring the AMR Protein Embedding Landscape
with Protein Language Models},
year = {2026},
url = {https://github.com/pranavathiyani/EmbedAMR}
}Pranavathiyani G Division of Bioinformatics, SASTRA Deemed University, Thanjavur, Tamil Nadu, India pranavathiyani@scbt.sastra.edu
Code assistance: Claude (Anthropic)