research-paper-digital-twin

This repository provides tools and scripts for creating and interacting with a retrieval-augmented generation (RAG) pipeline using langchain. It includes components for processing papers into a searchable knowledge base, maintaining dialogue history for multi-turn conversations, and logging user sessions.

Directory Structure

1. `kb/`

Purpose: Contains scripts for processing research papers and creating a FAISS (facebook AI) index for RAG.
Usage:
- Place the papers you want to use for RAG in the pdfs/ directory.
- If you would like to create your RAG knowledge base from GPT-summarized versions of the papers (recommended!):
  - Ensure your OpenAI API key is a system variable.
  - Run paper_processing.py, which will generate a summary .txt file under a new paper_processing_output/ directory.
- If you would like to create your RAG knowledge base from the plain PDF documents, move on to the next step.
- Run vector_store.py with the from_pdf variable set according to your decision. You may alter the metadata to be stored with each chunk around line 70; currently the script gets and stores the paper title from the web as metadata to be provided to the user when interacting with the model via main_chains/chat_history.py
- The output FAISS index will also be stored in this directory.
Key Scripts:
- paper_processing.py: Queries GPT via API calls to summarize all papers in pdfs/ and stores the summaries under paper_processing_output/.
- vector_store.py: Chunks input (PDF documents or text summaries) files via recursive splitting and stores the chunks alongside their embeddings in an index. You may specify the name of the index, but be sure to give the correct name to the scripts in main_chain/. Indexes are stored in this directory, kb/.
- get_titles.py: Has functionality to find paper titles from the web using their SSRN ID. This is useful to add additonal metadata to the RAG knowledge base. The web-based approach is prone to rate limiting (response code 429) so calling this function over many papers can cost a lot of time. Consider having necessary metadata compiled locally and modifying the creation of the knowledge base in vector_store.py to save time.
Output:
- .txt summaries from paper_processing.py saved to paper_processing_output/.
- faiss_index/ directory holds knowledge base generated by vector_store.py.

2. `main_chain/`

Purpose: Handles multi-turn dialogue functionality and logging user interactions with the chatbot.
Usage:
- In chat_history.py, specify the llm instances to use for chat history summarization/input relevance assessment as well as model responses. llm should be a stronger model (recommended: llama 70B) than llm2 (recommended: llama 8B). If you have enough compute to quickly generate responses, both llm and llm2 may use the same (strong) instance.
- Specify the path to the knowledge base created using vectore_store.py.
- Running the script will prompt the user to select the level of restrictiveness of model output.
  - 0: Model is unrestricted and will only refuse to answer questions that relate to privacy concerns, criminal behaviour, or hateful speech.
  - 1: Queries to model are passed through a 'discriminator' that will decide whether the question would be relevant to a professor taking questions from an audience. If not, the main model will politely refuse to answer.
  - 2: Model will only respond to queries that produce a hit in its knowledge base.
- Model output will be logged to text files in directories specified by the selected restrictiveness.
- Debug printing is ON; console logs the list of papers the model used to synthesize its response. If restrictiveness level 1 is selected, the DISCRIMINATOR output will be displayed. If restrictivesness level 2 is selected, 'NO HITS' will be printed if the model finds nothing in the knowledge base.
- The variable containing the model output (to be read aloud by a speech model) is response in the call_model function.
Key Scripts:
- chat_history.py: Supports multi-turn dialogue by maintaining the chat history context for interactions with the RAG-powered LLM. LLM prompts are found here, to be altered to user specification. Debug print statements may be altered or deleted.
- chat_logging.py: Functionality for logging user sessions to text files for later reference
Output:
- The logged text files are saved in directories within main_chain/.

3. `vanilla_kb/`

Purpose: A primitive implementation of a knowledge base for testing and exploration.
Contents:
- Manual implementation of knowledge base functionality by explicitly chunking passages, calculating their embeddings, and storing them in a Python dictionary.
- Includes dummy passages for experimentation and prototyping.

Setup

Prerequisites

Python 3.8 or later
Required Python packages are listed in requirements.txt.

Installation

Clone the repository:

git clone https://github.com/soljt/research-paper-digital-twin.git
cd politicians-digital-twins

Create and activate a virtual environment:

python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

Install dependencies:

pip install -r requirements.txt

Usage

Running Scripts

All scripts should be executed from the root directory of the repository to ensure file paths are correctly resolved. For example:

python kb/chat_history.py

Workflow

Create Knowledge Base:

Place papers in kb/pdfs/
Use the provided scripts in kb/ to create the knowledge base (see kb/ above)

Interact with the digital twin:

Use the scripts in main_chain/ to interact via the console

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
kb		kb
main_chain		main_chain
old_chains		old_chains
vanilla_kb		vanilla_kb
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

research-paper-digital-twin

Directory Structure

1. `kb/`

2. `main_chain/`

3. `vanilla_kb/`

Setup

Prerequisites

Installation

Usage

Running Scripts

Workflow

About

Releases

Packages

Languages

ethz-coss/research-paper-digital-twin

Folders and files

Latest commit

History

Repository files navigation

research-paper-digital-twin

Directory Structure

1. kb/

2. main_chain/

3. vanilla_kb/

Setup

Prerequisites

Installation

Usage

Running Scripts

Workflow

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

1. `kb/`

2. `main_chain/`

3. `vanilla_kb/`

Packages