Skip to content

bgzdaniel/PubMedTempGraph

Repository files navigation

PubMedTempGraph

Contributors:

Remark: Joint work is committed with both names in the commit message

Overview

This project utilizes a combination of Langchain, ChromaDB, and llama.cpp to build a retrieval augmented generation system for medical question answering. The project is structured into modular pipelines that can be run end-2-end to first obtain the data, preprocess and embed the data, and later perform queries to interact with the retrieved information similar to a Q&A chatbot. Said chatbot contains two answer modes, a research mode and an overview mode. In the research mode, the user's question is answered using the latest research as context. In the overview mode a timeline overview of research relevant to the user's query is generated.

Technologies Used

  • Langchain: Langchain is a framework for developing applications powered by language models, including information retrievers, text generation pipelines and other wrappers to facilitate a seamless integration of LLM-related open-source software.

  • ChromaDB: Chroma DB is an open-source vector storage system designed for efficiently storing and retrieving vector embeddings.

  • llama.cpp: llama.cpp implements Meta's LLaMa2 architecture in efficient C/C++ to enable a fast local runtime.

Installation & set-up

  1. Prerequisites:

    • Ensure you have Python installed on your system. Your Python version should match 3.10.
    • Ensure to have conda installed on your system.
    • Create a folder where you want to store the project.
  2. Create a Conda Environment:

    • Create a conda environment
    • Activate the environment
    conda create --name your_project_env python=3.10
    conda activate your_project_env
  3. Clone the Repository into your working directory:

    git clone git@github.com:bgzdaniel/PubMedTempGraph.git

    When using Mac set pgk_config path:

    export PKG_CONFIG_PATH="/opt/homebrew/opt/openblas/lib/pkgconfig"

    then switch to the working directory of the project:

    cd PubMedTempGraph
  4. Install Dependencies:

    pip install -r requirements.txt
  5. Llama.cpp GPU installation: (When using CPU only, skip this step.)

    This part might be slightly tricky, depending on which system the installation is done. We do NOT recommend installation on Windows. It has been tested, but requires multiple components which need to be downloaded. Please contact Daniel Bogacz for details.

    Linux:

    CMAKE_ARGS="-DLLAMA_CUBLAS=on" FORCE_CMAKE=1 pip install --upgrade --force-reinstall llama-cpp-python --no-cache-dir

    MacOS:

    CMAKE_ARGS="-DLLAMA_METAL=on" FORCE_CMAKE=1 pip install --upgrade --force-reinstall llama-cpp-python --no-cache-dir

If anything goes wrong in this step, please contact Daniel Bogacz for Linux installation issues and Kenneth Styppa for MacOS installation issues. Also refer to the installation guide provided here and also here

  1. Download chroma store and model files and place them into the right location:
    • Download the model file. Insert the model file at data/mistral-7b-instruct-v0.2.Q6_K.
    • For a complete setup, a vector store is necessary. One can create his own store with the step Embedding of abstracts and Loading embeddings to the vector database ChromaDB in the section More Usage Possibilities. For a minimal setup go to this google drive link and download the ChromaDB store (folder called chroma_store_abstracts). Insert the ChromaDB store at data/chroma_store.

Usage

Using the Q&A system

  1. Navigate to the main folder PubMedTempGraph in your terminal:

    cd PubMedTempGraph
  2. Activate the Q&A System:

    python -m chat
  3. Interact with the system:

    • Ask your question
    Please enter your question: [your_question]
    • Ask another question (and so on)

Note: Running the system for the first time might take some additional seconds because the model and the vector database has to be initialized.

Trouble-shooting:

  • If you encounter an issue during your usage install pyspellchecker separately and try again:
    pip install pyspellchecker
  • When encountering issues in the Llama.cpp installation, make sure you have NVIDIA Toolkit installed. Check with:
    nvcc --version
    Something similar to the following should appear:
    nvcc: NVIDIA (R) Cuda compiler driver
    Copyright (c) 2005-2023 NVIDIA Corporation
    Built on Wed_Feb__8_05:53:42_Coordinated_Universal_Time_2023
    Cuda compilation tools, release 12.1, V12.1.66
    Build cuda_12.1.r12.1/compiler.32415258_0
    Also make sure that CMake is installed on your system.

More Usage Possibilities:

  1. Embedding of abstracts:

    • For embedding abstracts files like studies_{year}.csv are required. Place them in data/studies. See here for the data. For example for the file studies_2019.csv the following command needs to be executed:
    python -m abstract2chroma.abstract2vec --year "2019"
  2. Loading embeddings to the vector database ChromaDB:

    • For loading abstract-based embeddings to the vector database, the files of type embeddings_{year}.csv are required.
    • Place it in data/embeddings.
    • Create them with the step Embedding of abstracts.
    • For the 'research' mode: Multiple embedding files need to be stored in one chroma store. For example for the files embeddings_2019.csv and embeddings_2020.csv the following command needs to be executed:
    python -m abstract2chroma.vec2chroma --years "2019,2020" --create_new
    • For the 'overview' mode: Multiple embedding files are needed to create multiple chroma stores for efficiency reasons. For example for the files embeddings_2019.csv and embeddings_2020.csv the following command needs to be executed:
    python -m abstract2chroma.vec2chroma_year --years "2019,2020" --create_new
    • The argument --create_new instructs the script to delete an existing database and create a new one. If a database already exists at data/chroma_store, then leave the argument parameter out of the execution command.
  3. Running Validation and Evaluation:

    • For the evaluation, BleuRT is required. First clone bleuRT:
    git clone https://github.com/google-research/bleurt.git

    Go in into the subfolder 'bleurt':

    cd bleurt

    Specifically for MacOS: Because tensorflow is differently named under MacOS, the install requirements have to be changed. Go to bleurt/setup.py and change in the list variable install_requires the entry tensorflow to tensorflow-macos. It should look like the following:

    install_requires = [
     "pandas", "numpy", "scipy", "tensorflow-macos", "tf-slim>=1.1", "sentencepiece"
    ]

    Save the file.

    Install bleuRT with the following:

    pip install . 

    The used BleuRT model can be found here. Place it under data/BLEURT-20-D12.

    For evaluation the steps Embedding of abstracts or paragraphs and Loading embeddings to the vector database ChromaDB have to be completed. Run the evaluation with the following command:

    python -m eval

    Acknowledgements

    We thank Prof. Herweg for his engaging course in "Event Processing" and the freedom to pursue this project.

    Disclaimer

    We confirm that we did this project on our own and only with the tools and means mentioned in the technical report submitted with this repository.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published