Skip to content

RAG Chatbot deployed to GCP with gpt-3.5 and Mistral-7B language models and gte-base-en-v1.5 with FAISS for vector search.

Notifications You must be signed in to change notification settings

poludmik/rag-poc-deployment

Repository files navigation

Financial Report Q&A Chatbot (RAG)

🎯 Project Goals

The primary objective of this project is to develop a generative Q&A chatbot capable of answering questions related to financial reports of different companies. The chatbot leverages language models and is grounded in a knowledge base constructed from PDF files of financial reports. The app enables users to upload new PDFs, ask questions, and receive answers produced by LLMs along with the relevant documents.

✨ Features

  • Knowledge Base: Uses PDF files from financial reports to build a robust knowledge base (new PDFs can be uploaded through the UI).
  • Grounding Techniques: Ensures minimal hallucination by grounding responses in retrieved documents with FAISS indexes and gte-base-en-v1.5 embeddings.
  • Cloud Deployment: Utilizes cloud-based SaaS offerings for scalable and robust performance.
  • Dual Model Backend: Integrates both GPT-3.5-Turbo from OpenAI and open-source Mistral-7B-Instruct-v0.2-AWQ for model inference.

🔧 Technologies Used

  • Backend: FastAPI for serving the backend API.
  • Frontend: Streamlit for an interactive user interface.
  • Cloud Services: Google Cloud Platform (GCP) for continuous deployment and storage.
  • Language Models: GPT-3.5-Turbo and Mistral-7B-Instruct-v0.2-AWQ for inference and gte-base-en-v1.5 for embeddings.

📁 Project Structure

  • backend.py: Main backend server handling API requests.
  • gpu_inference_server.py: Server running the Mistral LLM for GPU-based inference.
  • frontend.py: Streamlit-based frontend application.
  • utils.py: Utility functions for embeddings and OpenAI API calls.
  • cloudbuild.yaml: Configuration for building and deploying Docker images on GCP automatically.
  • dockerfiles/: Dockerfiles for backend and frontend containers.
  • scripts/: Example scripts for managing GCP instances and starting the LLM server.

Requirements and technologies used

Requirements are separate for backend, frontend and llm inference server to minimize the container memory usages. They are stored in the requirements folder. The main technologies that were used are:

  • Docker
  • Google Cloud Platform (storage, build/run, compute engine, secret manager, artifact registry, etc.)
  • Python 3.10
  • Streamlit
  • FastAPI
  • Huggingface transformers
  • OpenAI API

💻 Backend Server

The backend server is built using FastAPI and handles the main logic for processing questions, retrieving documents, and calling the appropriate LLM for inference.

Key Functions

  • get_answer: Retrieves documents from the index on GC Bucket, combines them, and generates an answer using the selected LLM.
  • combine_docs: Combines retrieved documents based on different strategies for improved context. Either the best vector similarity, top-k similarities docs, or Small-to-Big doc expansion. For the "1" and "small-to-big" ways of combining, the get_answer function will also return the page number of the document, which is also going to be displayed in the frontend.
  • get_index: Retrieves the index for a given filename from the Google Cloud Storage bucket.

Supported endpoints

  • answer: POST request to /answer/ with a JSON payload containing the filename, question, and model to use for inference.
@app.post("/answer/")
async def answer(request: QuestionRequest):
    try:
        answer, retrieved_docs = get_answer(request.filename, request.question, request.model)
        return {"filename": request.filename, "question": request.question, "answer": answer, "combined_docs": retrieved_docs}
    except Exception as e:
        raise HTTPException(status_code=500, detail=str(e))
  • list_pdfs: GET request to /list_pdfs/ to list all uploaded PDF files that are currently in the GC bucket.
  • create_and_upload: POST request to /create_and_upload/ to upload a new PDF file for indexing and querying.

⚡ GPU Inference Server

The GPU inference server is used to run the Mistral-7B-Instruct-v0.2-AWQ model for generating responses. It utilizes vllm for efficient model management and inference. Since the model is large and requires a GPU for inference, a separate server is used to offload the computation from the main backend server. This also allows for better resource management and scalability (e.g. when multiple LLM servers could be running).

Starting the Server

The following simple command is used to start the GPU inference server inside the Google Cloud Compute Engine instance. The server is started in the background and continues to run even when the SSH session is closed.

nohup uvicorn --host 0.0.0.0 --port 8000 gpu_inference_server:app &

Supported endpoints

  • answer: POST request to /answer/ with a JSON payload containing the question to ask the Mistral model.

🎨 Frontend

The frontend is built with Streamlit and provides an interactive interface for users to upload and select PDFs, ask questions, and view answers.

User Options

  • Select Existing PDF: Choose from already uploaded PDFs.
  • Upload New PDF: Upload a new PDF file for indexing and querying.
  • Select LLM Model: Choose between GPT-3.5-Turbo and Mistral-7B-Instruct-v0.2-AWQ for inference.
  • Ask Questions: Input questions related to the selected PDF and get responses.
if question := st.chat_input("What is up?"):
    with st.chat_message("user"):
        st.write(question)
    with st.spinner('Please wait...'):
        response = requests.post(answer_url, json.dumps({
            'question': question,
            'filename': st.session_state.current_file_name,
            'model': "mistral" if on else "gpt"
        }))
        st.write(response.json()["answer"])

☁️ Cloud Build Configuration

The cloudbuild.yaml file automates the process of building Docker images and deploying them to Google Cloud Run. The backend and frontend are deployed as separate services. The configuration file specifies the steps to build the Docker images, tag them with the appropriate version, and deploy them to Cloud Run. The Cloud Build trigger for this build is set to automatically deploy the latest changes to the main branch.

Demo Github Actions Workflow

A dummy test is set in the .github/workflows folder. The workflow is triggered on push or pull request to the main branch. The workflow runs a test that always passes. This is just a placeholder for the actual tests that could be implemented in the future.

Space for improvements

This is a PoC project and there are many areas that could be improved. Some of them are:

  • Table parsing for financial reports. This could be used to extract more structured information from the reports. Packages like camelot or tabula could be used for this.
  • Reranking models for the retrieved documents. This could be used to improve the quality of the retrieved documents.
  • More advanced document preprocessing. Techniques like filtering could be used to improve the quality of the retrieved documents.
  • Other vector storage techniques, vector databases.

About

RAG Chatbot deployed to GCP with gpt-3.5 and Mistral-7B language models and gte-base-en-v1.5 with FAISS for vector search.

Topics

Resources

Stars

Watchers

Forks

Packages

No packages published