This repository provides tools and scripts for creating and interacting with a retrieval-augmented generation (RAG) pipeline using langchain. It includes components for processing papers into a searchable knowledge base, maintaining dialogue history for multi-turn conversations, and logging user sessions.
- Purpose: Contains scripts for processing research papers and creating a FAISS (facebook AI) index for RAG.
- Usage:
- Place the papers you want to use for RAG in the
pdfs/
directory. - If you would like to create your RAG knowledge base from GPT-summarized versions of the papers (recommended!):
- Ensure your OpenAI API key is a system variable.
- Run paper_processing.py, which will generate a summary
.txt
file under a newpaper_processing_output/
directory.
- If you would like to create your RAG knowledge base from the plain PDF documents, move on to the next step.
- Run
vector_store.py
with thefrom_pdf
variable set according to your decision. You may alter the metadata to be stored with each chunk around line 70; currently the script gets and stores the paper title from the web as metadata to be provided to the user when interacting with the model viamain_chains/chat_history.py
- The output FAISS index will also be stored in this directory.
- Place the papers you want to use for RAG in the
- Key Scripts:
paper_processing.py
: Queries GPT via API calls to summarize all papers inpdfs/
and stores the summaries underpaper_processing_output/
.vector_store.py
: Chunks input (PDF documents or text summaries) files via recursive splitting and stores the chunks alongside their embeddings in an index. You may specify the name of the index, but be sure to give the correct name to the scripts inmain_chain/
. Indexes are stored in this directory,kb/
.get_titles.py
: Has functionality to find paper titles from the web using their SSRN ID. This is useful to add additonal metadata to the RAG knowledge base. The web-based approach is prone to rate limiting (response code 429) so calling this function over many papers can cost a lot of time. Consider having necessary metadata compiled locally and modifying the creation of the knowledge base invector_store.py
to save time.
- Output:
.txt
summaries frompaper_processing.py
saved topaper_processing_output/
.faiss_index/
directory holds knowledge base generated byvector_store.py
.
- Purpose: Handles multi-turn dialogue functionality and logging user interactions with the chatbot.
- Usage:
- In
chat_history.py
, specify the llm instances to use for chat history summarization/input relevance assessment as well as model responses.llm
should be a stronger model (recommended: llama 70B) thanllm2
(recommended: llama 8B). If you have enough compute to quickly generate responses, bothllm
andllm2
may use the same (strong) instance. - Specify the path to the knowledge base created using vectore_store.py.
- Running the script will prompt the user to select the level of restrictiveness of model output.
- 0: Model is unrestricted and will only refuse to answer questions that relate to privacy concerns, criminal behaviour, or hateful speech.
- 1: Queries to model are passed through a 'discriminator' that will decide whether the question would be relevant to a professor taking questions from an audience. If not, the main model will politely refuse to answer.
- 2: Model will only respond to queries that produce a hit in its knowledge base.
- Model output will be logged to text files in directories specified by the selected restrictiveness.
- Debug printing is ON; console logs the list of papers the model used to synthesize its response. If restrictiveness level 1 is selected, the DISCRIMINATOR output will be displayed. If restrictivesness level 2 is selected, 'NO HITS' will be printed if the model finds nothing in the knowledge base.
- The variable containing the model output (to be read aloud by a speech model) is
response
in thecall_model
function.
- In
- Key Scripts:
chat_history.py
: Supports multi-turn dialogue by maintaining the chat history context for interactions with the RAG-powered LLM. LLM prompts are found here, to be altered to user specification. Debug print statements may be altered or deleted.chat_logging.py
: Functionality for logging user sessions to text files for later reference
- Output:
- The logged text files are saved in directories within
main_chain/
.
- The logged text files are saved in directories within
- Purpose: A primitive implementation of a knowledge base for testing and exploration.
- Contents:
- Manual implementation of knowledge base functionality by explicitly chunking passages, calculating their embeddings, and storing them in a Python dictionary.
- Includes dummy passages for experimentation and prototyping.
- Python 3.8 or later
- Required Python packages are listed in
requirements.txt
.
- Clone the repository:
git clone https://github.com/soljt/research-paper-digital-twin.git
cd politicians-digital-twins
- Create and activate a virtual environment:
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
- Install dependencies:
pip install -r requirements.txt
All scripts should be executed from the root directory of the repository to ensure file paths are correctly resolved. For example:
python kb/chat_history.py
- Create Knowledge Base:
- Place papers in
kb/pdfs/
- Use the provided scripts in
kb/
to create the knowledge base (seekb/
above)
- Interact with the digital twin:
- Use the scripts in
main_chain/
to interact via the console