localGPT-Vision is an end-to-end vision-based Retrieval-Augmented Generation (RAG) system. It allows users to upload and index documents (PDFs and images), ask questions about the content, and receive responses along with relevant document snippets. The retrieval is performed using the Colqwen or ColPali models, and the retrieved pages are passed to a Vision Language Model (VLM) for generating responses. Currently, the code supports these VLMs:
- Qwen2-VL-7B-Instruct
- LLAMA-3.2-11B-Vision
- Pixtral-12B-2409
- Molmo-7B-O-0924
- Google Gemini
- OpenAI GPT-4o
The project is built on top of the Byaldi library.
- Features
- Architecture
- Prerequisites
- Installation
- Usage
- Project Structure
- System Workflow
- Contributing
- End-to-End Vision-Based RAG: Combines visual document retrieval with language models for comprehensive answers.
- Document Upload and Indexing: Upload PDFs and images, which are then indexed using ColPali for retrieval.
- Chat Interface: Engage in a conversational interface to ask questions about the uploaded documents.
- Session Management: Create, rename, switch between, and delete chat sessions.
- Model Selection: Choose between different Vision Language Models (Qwen2-VL-7B-Instruct, Google Gemini, OpenAI GPT-4 etc).
- Persistent Indexes: Indexes are saved on disk and loaded upon application restart.
localGPT-Vision is built as an end-to-end vision-based RAG system. T he architecture comprises two main components:
-
Visual Document Retrieval with Colqwen and ColPali:
- Colqwen and ColPali are Vision Encoders designed for efficient document retrieval solely using the image representation of document pages.
- It embeds page images directly, leveraging visual cues like layout, fonts, figures, and tables without relying on OCR or text extraction.
- During indexing, document pages are converted into image embeddings and stored.
- During querying, the user query is matched against these embeddings to retrieve the most relevant document pages.
-
Response Generation with Vision Language Models:
- The retrieved document images are passed to a Vision Language Model (VLM).
- Supported models include Qwen2-VL-7B-Instruct, LLAMA3.2, Pixtral, Molmo, Google Gemini, and OpenAI GPT-4.
- These models generate responses by understanding both the visual and textual content of the documents.
- NOTE: The quality of the responses is highly dependent on the VLM used and the resolution of the document images.
This architecture eliminates the need for complex text extraction pipelines and provides a more holistic understanding of documents by considering their visual elements. You don't need any chunking strategies or selection of embeddings model or retrieval strategy used in traditional RAG systems.
- Anaconda or Miniconda installed on your system
- Python 3.10 or higher
- Git (optional, for cloning the repository)
Follow these steps to set up and run the application on your local machine.
-
Clone the Repository
git clone https://github.com/PromtEngineer/localGPT-Vision.git cd localGPT-Vision
-
Create a Conda Environment
conda create -n localgpt-vision python=3.10 conda activate localgpt-vision
3a. Install Dependencies
pip install -r requirements.txt
3b. Install Transformers from HuggingFace - Dev version
pip uninstall transformers
pip install git+https://github.com/huggingface/transformers
-
Set Environment Variables Set your API keys for Google Gemini and OpenAI GPT-4:
export GENAI_API_KEY='your_genai_api_key' export OPENAI_API_KEY='your_openai_api_key' export GROQ_API_KEY='your_groq_api_key'
On Windows Command Prompt:
set GENAI_API_KEY=your_genai_api_key set OPENAI_API_KEY=your_openai_api_key set GROQ_API_KEY='your_groq_api_key'
-
Run the Application
python app.py
-
Access the Application Open your web browser and navigate to:
http://localhost:5000/
- Click on "New Chat" to start a new session.
- Under "Upload and Index Documents", click "Choose Files" and select your PDF or image files.
- Click "Upload and Index". The documents will be indexed using ColPali and ready for querying.
- In the "Enter your question here" textbox, type your query related to the uploaded documents.
- Click "Send". The system will retrieve relevant document pages and generate a response using the selected Vision Language Model.
- Rename Session: Click "Edit Name", enter a new name, and click "Save Name".
- Switch Sessions: Click on a session name in the sidebar to switch to that session.
- Delete Session: Click "Delete" next to a session to remove it permanently.
- Click on "Settings" in the navigation bar.
- Select the desired language model and image dimensions.
- Click "Save Settings".
localGPT-Vision/
├── app.py
├── logger.py
├── models/
│ ├── indexer.py
│ ├── retriever.py
│ ├── responder.py
│ ├── model_loader.py
│ └── converters.py
├── sessions/
├── templates/
│ ├── base.html
│ ├── chat.html
│ ├── settings.html
│ └── index.html
├── static/
│ ├── css/
│ │ └── style.css
│ ├── js/
│ │ └── script.js
│ └── images/
├── uploaded_documents/
├── byaldi_indices/
├── requirements.txt
├── .gitignore
└── README.md
app.py
: Main Flask application.logger.py
: Configures application logging.models/
: Contains modules for indexing, retrieving, and responding.templates/
: HTML templates for rendering views.static/
: Static files like CSS and JavaScript.sessions/
: Stores session data.uploaded_documents/
: Stores uploaded documents..byaldi/
: Stores the indexes created by Byaldi.requirements.txt
: Python dependencies..gitignore
: Files and directories to be ignored by Git.README.md
: Project documentation.
- User Interaction: The user interacts with the web interface to upload documents and ask questions.
- Document Indexing with ColPali:
- Uploaded documents are converted to PDFs if necessary.
- Documents are indexed using ColPali, which creates embeddings based on the visual content of the document pages.
- The indexes are stored in the byaldi_indices/ directory.
- Session Management:
- Each chat session has a unique ID and stores its own index and chat history.
- Sessions are saved on disk and loaded upon application restart.
- Query Processing:
- User queries are sent to the backend.
- The query is embedded and matched against the visual embeddings of document pages to retrieve relevant pages.
- Response Generation with Vision Language Models:
- The retrieved document images and the user query are passed to the selected Vision Language Model (Qwen, Gemini, or GPT-4).
- The VLM generates a response by understanding both the visual and textual content of the documents.
- Display Results:
- The response and relevant document snippets are displayed in the chat interface.
graph TD
A[User] -->|Uploads Documents| B(Flask App)
B -->|Saves Files| C[uploaded_documents/]
B -->|Converts and Indexes with ColPali| D[Indexing Module]
D -->|Creates Visual Embeddings| E[byaldi_indices/]
A -->|Asks Question| B
B -->|Embeds Query and Retrieves Pages| F[Retrieval Module]
F -->|Retrieves Relevant Pages| E
F -->|Passes Pages to| G[Vision Language Model]
G -->|Generates Response| B
B -->|Displays Response| A
B -->|Saves Session Data| H[sessions/]
subgraph Backend
B
D
F
G
end
subgraph Storage
C
E
H
end
Contributions are welcome! Please follow these steps:
- Fork the repository.
- Create a new branch for your feature:
git checkout -b feature-name
. - Commit your changes:
git commit -am 'Add new feature'
. - Push to the branch:
git push origin feature-name
. - Submit a pull request.