EmbedAMR

Exploring the AMR protein embedding landscape with protein language models

Protein language models learn sequence representations from evolutionary data without any functional supervision. EmbedAMR asks a simple question: do these representations naturally organise antimicrobial resistance proteins by their biological properties?

5,029 AMR proteins from CARD v3.3.0 are embedded using ESM2-650M and Ankh-base, projected to 2D and 3D UMAP, and evaluated with quantitative metrics. 187 wild-type penicillin-binding proteins from UniProt serve as biological controls.

Try the tool: huggingface.co/spaces/pranavathiyani/EmbedAMR

Findings

Coarse mechanism labels do not map cleanly onto embedding geometry. Overall silhouette scores are negative (ESM2: -0.129, Ankh: -0.133), driven by the antibiotic inactivation class which covers 80% of CARD across structurally unrelated enzyme folds. This reflects a mismatch between ARO ontology granularity and the finer-grained structure PLMs capture -- not a model failure.

Drug-class-specific clusters show meaningful coherence. Cephamycin resistance proteins cluster tightly (ESM2: 0.779, Ankh: 0.833), consistent with the structural coherence of AmpC-type enzymes. Glycopeptide resistance proteins also separate well (ESM2: 0.216, Ankh: 0.454).

ESM2 and Ankh partition AMR space differently. The two models agree on 53% of local protein neighbourhoods (K=10, PCA50), indicating model-specific signal worth exploiting for ensemble approaches.

Wild-type PBPs reveal an unexpected sub-zone structure. Without any labels, WT PBPs separate into two groups: soluble transpeptidases landing near Class C AmpC peripheral sequences, and membrane biosynthesis proteins co-locating with mecA and BlaZ resistance genes. Class A beta-lactamases -- often cited as PBP descendants -- are 185-fold more distant from wild-type PBPs than from each other, suggesting PLM embeddings reflect deep evolutionary divergence rather than fold-level proximity.

Tool

The interactive app lets you paste any protein sequence, embed it with ESM2-650M, and find its 10 nearest neighbours in the precomputed landscape. Results include a confidence tier (high / moderate / low) based on cosine distance, a plain-language interpretation of the predicted mechanism, and an interactive 3D map showing where your sequence lands.

Sequences that do not resemble known AMR proteins -- like hormones or housekeeping enzymes -- are correctly flagged as low confidence rather than silently assigned a wrong class.

Note: predictions are based on similarity to known AMR proteins in CARD. Novel resistance mechanisms or proteins outside this database may not be identified correctly.

Reproduce

Open EmbedAMR.ipynb in Google Colab with a T4 GPU. The notebook runs end to end in approximately 2 hours. Embedding steps checkpoint to Google Drive so disconnects are safe.

Cell	Description
1	GPU check
2	Install dependencies
3	Drive mount and checkpoint config
4	Download CARD v3.3.0
5	Parse FASTA headers and join ARO metadata
6	Mechanism labels, balanced evaluation sets, control tagging
7	Fetch wild-type PBPs from UniProt
8	Combine CARD and WT PBP datasets
9	ESM2-650M embeddings (~90 min on T4)
10	Ankh-base embeddings (~45 min on T4)
11	Sequence deduplication check
12	PCA, 2D UMAP, 3D UMAP
13	Quantitative evaluation
14	Save app assets
15	Generate figures and Table 1
16	Zip and export

Data

Source	Version	Sequences	Purpose
CARD broadstreet	v3.3.0	5,029	AMR reference proteins
UniProt Swiss-Prot	KW-0573, bacteria, reviewed	199 (187 after dedup)	Wild-type PBP controls

All data is fetched programmatically inside the notebook. No manual downloads needed. 12 UniProt sequences removed after exact and near-identical sequence matching against CARD.

Models

Model	Parameters	Embedding dim	Source
ESM2-650M	652M	1,280	facebook/esm2_t33_650M_UR50D
Ankh-base	453M	768	ElnaggarLab/ankh-base

Mean pooling of the last hidden state, excluding CLS and EOS tokens. Sequences longer than 1,022 aa use sliding window embedding (window: 1,022, stride: 256, length-weighted averaging).

Dimensionality reduction: StandardScaler, PCA (50 components, 89.7% variance), UMAP (n_neighbors=30, cosine metric, random_state=42).

Quantitative results

Metric	ESM2	Ankh
Silhouette L1 mechanism (UMAP 2D)	-0.129	-0.133
Silhouette L2 drug class (UMAP 2D)	-0.235	-0.192
Silhouette cephamycin cluster	0.779	0.833
Silhouette glycopeptide cluster	0.216	0.454
Neighbour overlap K=10 (PCA50)	0.530	--
WT PBP to Class A BLA distance ratio	185x	--
WT PBP to CARD mutant PBP distance	2.521	--

Controls

Positive controls

TEM allele panel (205 sequences): within-family coherence
Beta-lactamase classes A/B/C/D: mechanism-level separation
MFS/RND/ABC efflux pump families: structural family separation

Negative controls

Wild-type PBPs (187 sequences): sensitivity targets, not resistance genes
WT PBP vs Class A BLA: shared fold ancestry, functionally opposite roles

Structure

EmbedAMR/
├── EmbedAMR.ipynb              # reproducible notebook
├── requirements.txt
├── README.md
├── app/
│   ├── app.py                  # HuggingFace Spaces source
│   └── requirements.txt
└── results/
    ├── figures/                # interactive HTML panels
    │   ├── panel1_esm2_mechanism.html
    │   ├── panel2_ankh_mechanism.html
    │   ├── panel3_bla_classes_wt_pbp.html
    │   ├── panel4_efflux_families.html
    │   └── EmbedAMR_3D_BLA_WT_PBP.html
    └── table1_quantitative_results.csv

Precomputed embedding files (.npy, .pkl) are not tracked due to size. They are generated by the notebook and saved to Google Drive.

Citation

@misc{pranavathiyani2026embedamr,
  author = {Pranavathiyani G},
  title  = {EmbedAMR: Exploring the AMR Protein Embedding Landscape
             with Protein Language Models},
  year   = {2026},
  url    = {https://github.com/pranavathiyani/EmbedAMR}
}

Contact

Pranavathiyani G Division of Bioinformatics, SASTRA Deemed University, Thanjavur, Tamil Nadu, India pranavathiyani@scbt.sastra.edu

Code assistance: Claude (Anthropic)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

EmbedAMR

Findings

Tool

Reproduce

Data

Models

Quantitative results

Controls

Structure

Citation

Contact

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
app		app
results		results
.gitignore		.gitignore
EmbedAMR.ipynb		EmbedAMR.ipynb
README.md		README.md
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

EmbedAMR

Findings

Tool

Reproduce

Data

Models

Quantitative results

Controls

Structure

Citation

Contact

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages