RAG Document Parser

Project Description

RAG Document Parser is a modular system for parsing, chunking, and searching documents using Retrieval-Augmented Generation (RAG) techniques. It supports PDF ingestion, metadata extraction, chunking, and semantic search via Pinecone vector database, and provides a Flask web interface for user queries and feedback.

Demo

Grafana Dashboard

MLFlow Dashboard

Installation

Install dependencies with:

pip install -r requirements.txt

Environment Variables

Required variables:

PINECONE_API_KEY: Your Pinecone API key for vector storage and semantic search
GEMINI_API_KEY: Your Google Gemini API key for AI-powered answer generation

Create a .env file in the project root:

PINECONE_API_KEY=your_pinecone_api_key_here
GEMINI_API_KEY=your_gemini_api_key_here

Running the Web Interface

Start Flask:

FLASK_APP=src/web/app.py flask run

Navigate to http://localhost:5000.

Web Interface Features

Semantic Search: Search documents using dense and sparse vector embeddings (Pinecone).
Model Selection: Choose dense/sparse embedding models and reranker model via dropdowns.
Search Results: View retrieved document chunks with scores and metadata.
Reranking: Rerank results with different models, including Pinecone's managed rerankers.
User Feedback: Provide relevance feedback per result; feedback is logged for evaluation.
AI Answer Generation: Use Google Gemini for context-based Q&A.

Usage Example

from rag_document_parser import DocumentParser

parser = DocumentParser()
result = parser.parse("path/to/document.pdf")
print(result)

(Note: Actual usage follows ingestion and search pipeline in src/scripts/ingest_documents.py and the Flask app in src/web/app.py.)

Extending the Project

Add new embedding or reranker models via config/options.
Integrate with additional vector stores or classical indices (e.g., Elasticsearch).
Add more advanced feedback evaluation or continuous learning.

Architecture Decisions

See docs/adr/README.md for Architecture Decision Records.

Evaluation and Experiment Tracking

User feedback is logged and aggregated for precision/recall metrics.
MLflow is used for experiment tracking and metric visualization.
Model selection and evaluation are reflected in retrieval quality.

License

MIT License (see LICENSE).

Name		Name	Last commit message	Last commit date
Latest commit History 63 Commits
dashboards/grafana		dashboards/grafana
docs/adr		docs/adr
src		src
.gitignore		.gitignore
README.md		README.md
dev-requirements.txt		dev-requirements.txt
otel-config.yaml		otel-config.yaml
prometheus.yml		prometheus.yml
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

RAG Document Parser

Project Description

Demo

Grafana Dashboard

MLFlow Dashboard

Installation

Environment Variables

Running the Web Interface

Web Interface Features

Usage Example

Extending the Project

Architecture Decisions

Evaluation and Experiment Tracking

License

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

jaganraajan/rag-document-parser

Folders and files

Latest commit

History

Repository files navigation

RAG Document Parser

Project Description

Demo

Grafana Dashboard

MLFlow Dashboard

Installation

Environment Variables

Running the Web Interface

Web Interface Features

Usage Example

Extending the Project

Architecture Decisions

Evaluation and Experiment Tracking

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages