Skip to content

RAG-powered chatbot with FAISS knowledge base and chat history logging

Notifications You must be signed in to change notification settings

ethz-coss/research-paper-digital-twin

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

research-paper-digital-twin

This repository provides tools and scripts for creating and interacting with a retrieval-augmented generation (RAG) pipeline using langchain. It includes components for processing papers into a searchable knowledge base, maintaining dialogue history for multi-turn conversations, and logging user sessions.

Directory Structure

1. kb/

  • Purpose: Contains scripts for processing research papers and creating a FAISS (facebook AI) index for RAG.
  • Usage:
    • Place the papers you want to use for RAG in the pdfs/ directory.
    • If you would like to create your RAG knowledge base from GPT-summarized versions of the papers (recommended!):
      • Ensure your OpenAI API key is a system variable.
      • Run paper_processing.py, which will generate a summary .txt file under a new paper_processing_output/ directory.
    • If you would like to create your RAG knowledge base from the plain PDF documents, move on to the next step.
    • Run vector_store.py with the from_pdf variable set according to your decision. You may alter the metadata to be stored with each chunk around line 70; currently the script gets and stores the paper title from the web as metadata to be provided to the user when interacting with the model via main_chains/chat_history.py
    • The output FAISS index will also be stored in this directory.
  • Key Scripts:
    • paper_processing.py: Queries GPT via API calls to summarize all papers in pdfs/ and stores the summaries under paper_processing_output/.
    • vector_store.py: Chunks input (PDF documents or text summaries) files via recursive splitting and stores the chunks alongside their embeddings in an index. You may specify the name of the index, but be sure to give the correct name to the scripts in main_chain/. Indexes are stored in this directory, kb/.
    • get_titles.py: Has functionality to find paper titles from the web using their SSRN ID. This is useful to add additonal metadata to the RAG knowledge base. The web-based approach is prone to rate limiting (response code 429) so calling this function over many papers can cost a lot of time. Consider having necessary metadata compiled locally and modifying the creation of the knowledge base in vector_store.py to save time.
  • Output:
    • .txt summaries from paper_processing.py saved to paper_processing_output/.
    • faiss_index/ directory holds knowledge base generated by vector_store.py.

2. main_chain/

  • Purpose: Handles multi-turn dialogue functionality and logging user interactions with the chatbot.
  • Usage:
    • In chat_history.py, specify the llm instances to use for chat history summarization/input relevance assessment as well as model responses. llm should be a stronger model (recommended: llama 70B) than llm2 (recommended: llama 8B). If you have enough compute to quickly generate responses, both llm and llm2 may use the same (strong) instance.
    • Specify the path to the knowledge base created using vectore_store.py.
    • Running the script will prompt the user to select the level of restrictiveness of model output.
      • 0: Model is unrestricted and will only refuse to answer questions that relate to privacy concerns, criminal behaviour, or hateful speech.
      • 1: Queries to model are passed through a 'discriminator' that will decide whether the question would be relevant to a professor taking questions from an audience. If not, the main model will politely refuse to answer.
      • 2: Model will only respond to queries that produce a hit in its knowledge base.
    • Model output will be logged to text files in directories specified by the selected restrictiveness.
    • Debug printing is ON; console logs the list of papers the model used to synthesize its response. If restrictiveness level 1 is selected, the DISCRIMINATOR output will be displayed. If restrictivesness level 2 is selected, 'NO HITS' will be printed if the model finds nothing in the knowledge base.
    • The variable containing the model output (to be read aloud by a speech model) is response in the call_model function.
  • Key Scripts:
    • chat_history.py: Supports multi-turn dialogue by maintaining the chat history context for interactions with the RAG-powered LLM. LLM prompts are found here, to be altered to user specification. Debug print statements may be altered or deleted.
    • chat_logging.py: Functionality for logging user sessions to text files for later reference
  • Output:
    • The logged text files are saved in directories within main_chain/.

3. vanilla_kb/

  • Purpose: A primitive implementation of a knowledge base for testing and exploration.
  • Contents:
    • Manual implementation of knowledge base functionality by explicitly chunking passages, calculating their embeddings, and storing them in a Python dictionary.
    • Includes dummy passages for experimentation and prototyping.

Setup

Prerequisites

  • Python 3.8 or later
  • Required Python packages are listed in requirements.txt.

Installation

  1. Clone the repository:
git clone https://github.com/soljt/research-paper-digital-twin.git
cd politicians-digital-twins
  1. Create and activate a virtual environment:
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate
  1. Install dependencies:
pip install -r requirements.txt

Usage

Running Scripts

All scripts should be executed from the root directory of the repository to ensure file paths are correctly resolved. For example:

python kb/chat_history.py

Workflow

  1. Create Knowledge Base:
  • Place papers in kb/pdfs/
  • Use the provided scripts in kb/ to create the knowledge base (see kb/ above)
  1. Interact with the digital twin:
  • Use the scripts in main_chain/ to interact via the console

About

RAG-powered chatbot with FAISS knowledge base and chat history logging

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 100.0%