Skip to content

KavrakiLab/STAG-LLM

Repository files navigation

STAG-LLM

This repository contains the code for STAG-LLM, a novel model for predicting TCR-pMHC binding specificity by integrating sequence information from a pre-trained Large Language Model (ESM-2) with structural insights captured by a Graph Neural Network (GNN).

Executable Notebook

For easiest inference using our model, run STAG-LLM in our google colab notebook. Note, you will first need to model your TCR-pHLA complex using TCRmodel2.

Model Architecture

The STAG-LLM model combines sequence embeddings generated by a fine-tuned ESM-2 model with graph representations derived from TCR-pMHC structures. These two modalities are then combined for binding specificity prediction.

STAG-LLM Architecture

Data Preparation

IMPORTANT: Before running any code, you must download the data folder, which contains the raw input data for the project. Please download it from data and place it in the root directory of this project.

After downloading the data folder, you need to unzip and preprocess the PDB files to convert them into graph representations that can be used by the model. Run the pdbs_to_graphs.py script:

python pdbs_to_graphs.py

Training the Model

To replicate the experiments from the paper and train the STAG-LLM model from scratch:

python train.py

Training progress and evaluation metrics will be logged in the test directory (or the directory configured in train.py).

Using Pretrained Models for Evaluation

Pretrained models are provided in the pretrained_models directory. Please download it from pretrained_models and place it in the root directory of this project. You can use these models to score individual input PDB files or evaluate on a test set.

  1. Place your input PDB files in a designated directory. (PDB files must contian D,E chians for the TCR and A,C chains for the pMHC. We recomend modeling structures with TCRmodel2)

  2. Run the evaluate.py script:

    python evaluate.py --model_path path/to/your/pretrained_model.pt --pdb_file path/to/your/input.pdb
    
    

Project Structure

.
├── data/
│   ├── full_seq_df_new.csv
│   └── final_dataset_modeled.csv
│   └── top_structures.zip (pdb data files)
├── hetero_edge_graphs/ (generated by pdbs_to_graphs.py)
├── pretrained_models/
│   └── ... (pretrained model checkpoints)
├── requirements.txt
├── train.py
├── model.py
├── data_handling.py
├── utils.py
├── pdbs_to_graphs.p
└── evaluate.py
├── README.md
├── STAG_LLM_image.png (image asset)

We have compared our approach to five models from the literature.

  • For the STAG model, please visit STAG
  • For the netTCR 2.2 model, please visit NetTCR 2.2
  • For the TCR-ESM model, please visit TCR-ESM
  • For the ERGO II (AE and LSTM), please visit ERGO-II

Citation

Jared K. Slone, Minying Zhang, Peixin Jiang, Amanda Montoya, Emily Bontekoe, Barbara Nassif Rausseo, Alexandre Reuben, Lydia E. Kavraki, STAG-LLM: Predicting TCR-pHLA binding with protein language models and computationally generated 3D structures, Computational and Structural Biotechnology Journal, Volume 27, 2025, Pages 3885-3896, ISSN 2001-0370, https://doi.org/10.1016/j.csbj.2025.09.004. (https://www.sciencedirect.com/science/article/pii/S2001037025003642)

About

Large language models for binding affinity

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages