The data is scraped from: https://www.valmikiramayan.net/
The script extracts shlokas and their corresponding English translations for all Kandas (Books) available on the site:
- Bala Kanda
- Ayodhya Kanda
- Aranya Kanda
- Kishkindha Kanda
- Sundara Kanda
- Yuddha Kanda
- Sentence Embeddings:
sentence-transformers(specificallyall-MiniLM-L6-v2) are used to convert text (verses and statements) into dense vector representations. This allows for semantic similarity comparisons, crucial for retrieving relevant context. - Dimensionality Reduction: Principal Component Analysis (PCA) from
scikit-learnis applied to the high-dimensional sentence embeddings. This can help in visualizing embeddings and potentially improving the efficiency of similarity searches, though in this notebook, the similarity search for context retrieval is performed on PCA-transformed embeddings. - Large Language Models (LLMs) & Hugging Face Pipeline:
- The project utilizes
google/flan-t5-largefrom the Hugging Face Hub as the LLM for statement verification. - The
transformerslibrary'spipeline(specifically fortext2text-generation) provides a high-level API to easily perform inference with this LLM. langchain-huggingface'sHuggingFacePipelineis used to integrate this Hugging Face pipeline into a Langchain workflow, simplifying the process of sending prompts (which include the statement and retrieved context) to the LLM and getting back its TRUE/FALSE/NONE determination.bitsandbytesis configured for 8-bit quantization to load the LLM more efficiently, reducing memory footprint while aiming to maintain performance.
- The project utilizes
- Python: Version 3.9 to 3.12. You can download it from python.org.
(The notebook
version_2.ipynbwas developed with Python 3.11). - CUDA: If you intend to use a GPU (recommended for LLM inference), ensure you have NVIDIA drivers and CUDA Toolkit 11.8 installed. The PyTorch installation is configured for this version.
If you haven't already, clone the project repository to your local machine.
git clone https://github.com/parth1609/Ramayana.git# Create a virtual environment (e.g., named 'venv')
python -m venv venv
# Activate the virtual environment
# On Windows:
.\venv\Scripts\activate
# On macOS/Linux:
# source venv/bin/activateThis project uses uv for faster package installation, but pip can also be used.
# Install uv (if not already installed)
pip install uv
# Install core dependencies using uv
uv pip install pandas numpy matplotlib plotly nbformat nltk transformers scikit-learn spacy sentence-transformers accelerate bitsandbytes langchain-huggingface ipython
# Install PyTorch with CUDA 11.8 support
```bash
https://pytorch.org/get-started/locally/uv pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118The notebook requires specific NLTK data. You can download them by running the following Python code (or these steps will be attempted automatically when you run the notebook):
import nltk
nltk.download('punkt', quiet=True)
nltk.download('punkt_tab', quiet=True)- Test Suite: Includes example and batch tests for pipeline evaluation.
- Upload Any CSV Dataset: Upload your own CSV file with Ramayana verses or any text data
- Flexible Column Selection: Choose any column from your CSV as the text source
- Dynamic Processing: Embeddings and models adapt to your uploaded data
- Local AI Models: No API keys required - everything runs locally
- Single Statement Verification: Get TRUE/FALSE/NONE answers with retrieved context
The frontend Streamlit app lives in app.py with a modular backend under the ramayana/ package:
ramayana/constants.py— defaults and labelsramayana/data.py— dataset loading and cleaningramayana/embeddings.py— sentence embeddings and PCAramayana/retrieval.py— similarity search to fetch contextsramayana/prompts.py— prompt templates and renderingramayana/llm.py— LLM device/quantization and pipeline loadingramayana/verification.py— label parsing and verificationramayana/types.py— typed containers for results
- Activate your virtual environment
- Install dependencies:
pip install -r requirements.txt
- Start the app:
streamlit run app.py
- Upload your CSV: Use the file uploader in the sidebar to upload any CSV file
- Select text column: Choose which column contains your text data (e.g., "English_translation", "Verse", "Text", etc.)
- Configure models: Optionally change the Sentence-Transformer or LLM model
- Verify statements: Enter statements in the main area and get AI-powered verification
- Format: CSV file with at least one text column
- Content: Any text data (Ramayana verses, religious texts, stories, etc.)
- Flexibility: No fixed column names required - you choose the column
- Size: Works with datasets of any size (larger datasets take more time for initial processing)
- Embeddings: Auto-generated and cached per dataset using content hash
- GPU Support: Automatic CUDA detection with 8-bit quantization when available
- CPU Fallback: Works entirely on CPU if no GPU available
- Caching: Smart caching system - new uploads trigger complete reprocessing
- No API Keys: All models run locally (Sentence-Transformers + Hugging Face models)