Skip to content

pranavathiyani/EmbedAMR

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

EmbedAMR

Exploring the AMR protein embedding landscape with protein language models

Open In Colab HuggingFace Spaces License: MIT


Protein language models learn sequence representations from evolutionary data without any functional supervision. EmbedAMR asks a simple question: do these representations naturally organise antimicrobial resistance proteins by their biological properties?

5,029 AMR proteins from CARD v3.3.0 are embedded using ESM2-650M and Ankh-base, projected to 2D and 3D UMAP, and evaluated with quantitative metrics. 187 wild-type penicillin-binding proteins from UniProt serve as biological controls.

Try the tool: huggingface.co/spaces/pranavathiyani/EmbedAMR


Findings

Coarse mechanism labels do not map cleanly onto embedding geometry. Overall silhouette scores are negative (ESM2: -0.129, Ankh: -0.133), driven by the antibiotic inactivation class which covers 80% of CARD across structurally unrelated enzyme folds. This reflects a mismatch between ARO ontology granularity and the finer-grained structure PLMs capture -- not a model failure.

Drug-class-specific clusters show meaningful coherence. Cephamycin resistance proteins cluster tightly (ESM2: 0.779, Ankh: 0.833), consistent with the structural coherence of AmpC-type enzymes. Glycopeptide resistance proteins also separate well (ESM2: 0.216, Ankh: 0.454).

ESM2 and Ankh partition AMR space differently. The two models agree on 53% of local protein neighbourhoods (K=10, PCA50), indicating model-specific signal worth exploiting for ensemble approaches.

Wild-type PBPs reveal an unexpected sub-zone structure. Without any labels, WT PBPs separate into two groups: soluble transpeptidases landing near Class C AmpC peripheral sequences, and membrane biosynthesis proteins co-locating with mecA and BlaZ resistance genes. Class A beta-lactamases -- often cited as PBP descendants -- are 185-fold more distant from wild-type PBPs than from each other, suggesting PLM embeddings reflect deep evolutionary divergence rather than fold-level proximity.


Tool

The interactive app lets you paste any protein sequence, embed it with ESM2-650M, and find its 10 nearest neighbours in the precomputed landscape. Results include a confidence tier (high / moderate / low) based on cosine distance, a plain-language interpretation of the predicted mechanism, and an interactive 3D map showing where your sequence lands.

Sequences that do not resemble known AMR proteins -- like hormones or housekeeping enzymes -- are correctly flagged as low confidence rather than silently assigned a wrong class.

Note: predictions are based on similarity to known AMR proteins in CARD. Novel resistance mechanisms or proteins outside this database may not be identified correctly.


Reproduce

Open EmbedAMR.ipynb in Google Colab with a T4 GPU. The notebook runs end to end in approximately 2 hours. Embedding steps checkpoint to Google Drive so disconnects are safe.

Cell Description
1 GPU check
2 Install dependencies
3 Drive mount and checkpoint config
4 Download CARD v3.3.0
5 Parse FASTA headers and join ARO metadata
6 Mechanism labels, balanced evaluation sets, control tagging
7 Fetch wild-type PBPs from UniProt
8 Combine CARD and WT PBP datasets
9 ESM2-650M embeddings (~90 min on T4)
10 Ankh-base embeddings (~45 min on T4)
11 Sequence deduplication check
12 PCA, 2D UMAP, 3D UMAP
13 Quantitative evaluation
14 Save app assets
15 Generate figures and Table 1
16 Zip and export

Data

Source Version Sequences Purpose
CARD broadstreet v3.3.0 5,029 AMR reference proteins
UniProt Swiss-Prot KW-0573, bacteria, reviewed 199 (187 after dedup) Wild-type PBP controls

All data is fetched programmatically inside the notebook. No manual downloads needed. 12 UniProt sequences removed after exact and near-identical sequence matching against CARD.


Models

Model Parameters Embedding dim Source
ESM2-650M 652M 1,280 facebook/esm2_t33_650M_UR50D
Ankh-base 453M 768 ElnaggarLab/ankh-base

Mean pooling of the last hidden state, excluding CLS and EOS tokens. Sequences longer than 1,022 aa use sliding window embedding (window: 1,022, stride: 256, length-weighted averaging).

Dimensionality reduction: StandardScaler, PCA (50 components, 89.7% variance), UMAP (n_neighbors=30, cosine metric, random_state=42).


Quantitative results

Metric ESM2 Ankh
Silhouette L1 mechanism (UMAP 2D) -0.129 -0.133
Silhouette L2 drug class (UMAP 2D) -0.235 -0.192
Silhouette cephamycin cluster 0.779 0.833
Silhouette glycopeptide cluster 0.216 0.454
Neighbour overlap K=10 (PCA50) 0.530 --
WT PBP to Class A BLA distance ratio 185x --
WT PBP to CARD mutant PBP distance 2.521 --

Controls

Positive controls

  • TEM allele panel (205 sequences): within-family coherence
  • Beta-lactamase classes A/B/C/D: mechanism-level separation
  • MFS/RND/ABC efflux pump families: structural family separation

Negative controls

  • Wild-type PBPs (187 sequences): sensitivity targets, not resistance genes
  • WT PBP vs Class A BLA: shared fold ancestry, functionally opposite roles

Structure

EmbedAMR/
├── EmbedAMR.ipynb              # reproducible notebook
├── requirements.txt
├── README.md
├── app/
│   ├── app.py                  # HuggingFace Spaces source
│   └── requirements.txt
└── results/
    ├── figures/                # interactive HTML panels
    │   ├── panel1_esm2_mechanism.html
    │   ├── panel2_ankh_mechanism.html
    │   ├── panel3_bla_classes_wt_pbp.html
    │   ├── panel4_efflux_families.html
    │   └── EmbedAMR_3D_BLA_WT_PBP.html
    └── table1_quantitative_results.csv

Precomputed embedding files (.npy, .pkl) are not tracked due to size. They are generated by the notebook and saved to Google Drive.


Citation

@misc{pranavathiyani2026embedamr,
  author = {Pranavathiyani G},
  title  = {EmbedAMR: Exploring the AMR Protein Embedding Landscape
             with Protein Language Models},
  year   = {2026},
  url    = {https://github.com/pranavathiyani/EmbedAMR}
}

Contact

Pranavathiyani G Division of Bioinformatics, SASTRA Deemed University, Thanjavur, Tamil Nadu, India pranavathiyani@scbt.sastra.edu


Code assistance: Claude (Anthropic)

About

Exploring the AMR protein embedding landscape with ESM2 and Ankh. Interactive tool at HuggingFace Spaces.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages