This project was developed at the 2023 Carnegie Mellon & DNAnexus Hackathon hosted on October 19th to 21st, 2023. Scientists across the country worked together in-person and remotely to advance applications in data management and graph extraction for large models in the biomedical space.
Our team came together from multiple locations.
- Emerson Huitt @ Raleigh, NC
- Glenda Tan @ Pittsburgh, PA
- Minh Tran @ Pittsburgh, PA
- Mingyue (Phoebe) Li @ Pittsburgh, PA
- David Enoma @ Canada
- Van Truong @ Philadelphia, PA
Given that LLMs can answer queries quickly and efficiently, they could be useful in recommending drug treatments for various diseases, where time is of the essence. However, as their training data may be factually incorrect or outdated, large-language models (LLMs) are often unreliable in recommending drug treatments.
Knowledge graphs can store factually correct relationships between data points, giving them the ability to answer complex queries. Hence, we believe that fine-tuning LLMs with knowledge graphs can aid LLMs in drug treatment recommendations.
Our project aims to improve the inference ability of LLMs for generating recommendations on drug treatment. We constructed a knowledge graph catered to therapeutics and then fine-tuned the LLM on the information network.
graph LR;
A["Small-scale experiments
following LLM & KG paper"]
B["Generate a few hundred
relationship graph for a
small scale experiment"]
D["Load raw graph data (e.g. Drugbank)"]
C["Encode entire graph in
triple form as text
LLM context"]
X["Use KG based context for
fact based grounding"]
E{"Graph fits in
context window?"}
F["Generate query for
relevant triples"]
G["Extract triples from graph"]
H{"Small enough for
context window?"}
I["Hierarchically refine
query prompt with triples"]
A-->D
B-->D
D-->E
E-->|Yes|C
C-->X
E-->|No|F
F-->G
G-->H
H-->|Yes|X
H-->|No|I
I-->F
38,617 drug-relationship-target triples were compiled from the Therapeutic Target Database [4], which are in “drug_relationship_target.csv”. URL to Therapeutic Target database: https://db.idrblab.net/ttd/
Below is a knowledge graph generated from 20 sample triples.
Using a self-written algorithm, the triples were preprocessed into a prompt-response format for finetuning LLAMA2. For instance, a sample prompt would be "[INST] Tell me more about the drug with ID D07OAC. [/INST]", and its corresponding response would be "Drug D07OAC is an inhibitor to target protein S5A2_HUMAN."
Traditional fine-tuning approaches generally require retraining the last layers of the LLMs, which is computationally-expensive. To overcome this, we leveraged QLora, an efficient parameter tuning method that uses Low Rank Adaptation and Double Quantization to reduce training and inference costs. A LLMs-7B model was fine-tuned on our preprocessed data for 3 epochs. Training was done on a NVIDIA Tesla A100 and training time was approximately 3 hours.
Our fine-tuned model LLaMA2Glenda is deployed at https://huggingface.co/tminh/llama-2-7b-glendalorian
Finally, our fine-tuned model was benchmarked against ChatGPT and the original LLAMA2 model using the prompt "What can the drug with ID D0Y6UB do?". From the results, ChatGPT and the original LLAMA2 model were unable to respond to this prompt, whereas our fine-tuned model could suggest possible target proteins to the drug and the corresponding binding relationships.
ChatGPT's response:
Our fine-tuned LLAMA2's response:
The landscape of well-curated biomedical knowledge graphs is vast. In the future, we plan to fine-tune LLMs onto additional biomedical KGs, enabling LLMs to answer a wider range of treatment-related queries. We also plan to fine-tune more LLMs on protein-protein interactions as well as drug-drug relationships.
Knowledge Source | Open Source? | Date Released | URL |
---|---|---|---|
Petagraph | Yes | 2023 | Link / Paper |
PrimeKG | Yes | 2023 | Link / Paper |
RTX-KG2 | Yes | 2022 | Link / Paper |
Drug Repurposing Knowledge Graph | Yes | 2021 | Link / Paper |
ROBOKOP | Yes | 2019 | Link / Paper |
Reactome Graph | Yes | 2018 | Link / Paper |
Hetionet | Yes | 2017 | Link / Paper |
DrugBank | No | 2008 | Link / Paper |
-
Pan, S., Luo, L., Wang, Y., Chen, C. et al. Unifying Large Language Models and Knowledge Graphs: A Roadmap. 20 June 2023, https://doi.org/10.48550/arXiv.2306.08302
-
Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, Luke Zettlemoyer. QLoRA: Efficient Finetuning of Quantized LLMs. 23 May 2023, https://arxiv.org/pdf/2305.14314.pdf
-
Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen. LoRA: Low-Rank Adaptation of Large Language Models, https://arxiv.org/pdf/2106.09685.pdf
-
Y. Zhou, Y. T. Zhang, D. H. Zhao, X. Y. Yu, X. Y. Shen, Y. Zhou, S. S. Wang , Y. Q. Qiu*, Y. Z. Chen* & F. Zhu* . TTD: Therapeutic Target Database describing target druggability information. Nucleic Acids Research. doi: 10.1093/nar/gkad751 (2023). PMID: 37713619
-
Stear, B. J., Ahooyi, T. M., Vasisht, S., Simmons, A., Beigel, K., Callahan, T. J., Silverstein, J. C., & Taylor, D. M. (2023, January 1). Petagraph: A large-scale unifying knowledge graph framework for integrating biomolecular and biomedical data. bioRxiv. https://www.biorxiv.org/content/10.1101/2023.02.11.528088v1
-
Tian, Y., Song, H., Wang, Z., Wang, H., Hu, Z., Wang, F., Chawla, N. V., & Xu, P. (2023). Graph Neural Prompting with Large Language Models (arXiv:2309.15427). arXiv. https://doi.org/10.48550/arXiv.2309.15427