This repository contains the code for STAG-LLM, a novel model for predicting TCR-pMHC binding specificity by integrating sequence information from a pre-trained Large Language Model (ESM-2) with structural insights captured by a Graph Neural Network (GNN).
For easiest inference using our model, run STAG-LLM in our google colab notebook. Note, you will first need to model your TCR-pHLA complex using TCRmodel2.
The STAG-LLM model combines sequence embeddings generated by a fine-tuned ESM-2 model with graph representations derived from TCR-pMHC structures. These two modalities are then combined for binding specificity prediction.
IMPORTANT: Before running any code, you must download the data folder, which contains the raw input data for the project. Please download it from data and place it in the root directory of this project.
After downloading the data folder, you need to unzip and preprocess the PDB files to convert them into graph representations that can be used by the model.
Run the pdbs_to_graphs.py script:
python pdbs_to_graphs.py
To replicate the experiments from the paper and train the STAG-LLM model from scratch:
python train.py
Training progress and evaluation metrics will be logged in the test directory (or the directory configured in train.py).
Pretrained models are provided in the pretrained_models directory. Please download it from pretrained_models and place it in the root directory of this project. You can use these models to score individual input PDB files or evaluate on a test set.
-
Place your input PDB files in a designated directory. (PDB files must contian D,E chians for the TCR and A,C chains for the pMHC. We recomend modeling structures with TCRmodel2)
-
Run the
evaluate.pyscript:python evaluate.py --model_path path/to/your/pretrained_model.pt --pdb_file path/to/your/input.pdb
.
├── data/
│ ├── full_seq_df_new.csv
│ └── final_dataset_modeled.csv
│ └── top_structures.zip (pdb data files)
├── hetero_edge_graphs/ (generated by pdbs_to_graphs.py)
├── pretrained_models/
│ └── ... (pretrained model checkpoints)
├── requirements.txt
├── train.py
├── model.py
├── data_handling.py
├── utils.py
├── pdbs_to_graphs.p
└── evaluate.py
├── README.md
├── STAG_LLM_image.png (image asset)
- For the STAG model, please visit STAG
- For the netTCR 2.2 model, please visit NetTCR 2.2
- For the TCR-ESM model, please visit TCR-ESM
- For the ERGO II (AE and LSTM), please visit ERGO-II
Jared K. Slone, Minying Zhang, Peixin Jiang, Amanda Montoya, Emily Bontekoe, Barbara Nassif Rausseo, Alexandre Reuben, Lydia E. Kavraki, STAG-LLM: Predicting TCR-pHLA binding with protein language models and computationally generated 3D structures, Computational and Structural Biotechnology Journal, Volume 27, 2025, Pages 3885-3896, ISSN 2001-0370, https://doi.org/10.1016/j.csbj.2025.09.004. (https://www.sciencedirect.com/science/article/pii/S2001037025003642)
