Ramayana-Data Text Analysis

1. Data Scraping & Preparation (`scrape_data.ipynb`)

Source

The data is scraped from: https://www.valmikiramayan.net/

Data Scraped

The script extracts shlokas and their corresponding English translations for all Kandas (Books) available on the site:

Bala Kanda
Ayodhya Kanda
Aranya Kanda
Kishkindha Kanda
Sundara Kanda
Yuddha Kanda

Dataset

Ramayana Shloka Dataset

Core NLP Techniques Used

Sentence Embeddings: sentence-transformers (specifically all-MiniLM-L6-v2) are used to convert text (verses and statements) into dense vector representations. This allows for semantic similarity comparisons, crucial for retrieving relevant context.
Dimensionality Reduction: Principal Component Analysis (PCA) from scikit-learn is applied to the high-dimensional sentence embeddings. This can help in visualizing embeddings and potentially improving the efficiency of similarity searches, though in this notebook, the similarity search for context retrieval is performed on PCA-transformed embeddings.
Large Language Models (LLMs) & Hugging Face Pipeline:
- The project utilizes google/flan-t5-large from the Hugging Face Hub as the LLM for statement verification.
- The transformers library's pipeline (specifically for text2text-generation) provides a high-level API to easily perform inference with this LLM.
- langchain-huggingface's HuggingFacePipeline is used to integrate this Hugging Face pipeline into a Langchain workflow, simplifying the process of sending prompts (which include the statement and retrieved context) to the LLM and getting back its TRUE/FALSE/NONE determination.
- bitsandbytes is configured for 8-bit quantization to load the LLM more efficiently, reducing memory footprint while aiming to maintain performance.

Prerequisites

Python: Version 3.9 to 3.12. You can download it from python.org. (The notebook version_2.ipynb was developed with Python 3.11).
CUDA: If you intend to use a GPU (recommended for LLM inference), ensure you have NVIDIA drivers and CUDA Toolkit 11.8 installed. The PyTorch installation is configured for this version.

Setup Instructions

1. Clone the Repository (Optional)

If you haven't already, clone the project repository to your local machine.

git clone https://github.com/parth1609/Ramayana.git

2. Create and Activate Virtual Environment

# Create a virtual environment (e.g., named 'venv')
python -m venv venv

# Activate the virtual environment
# On Windows:
.\venv\Scripts\activate
# On macOS/Linux:
# source venv/bin/activate

3. Install Python Packages

This project uses uv for faster package installation, but pip can also be used.

# Install uv (if not already installed)
pip install uv

# Install core dependencies using uv
uv pip install pandas numpy matplotlib plotly nbformat nltk transformers scikit-learn spacy sentence-transformers accelerate bitsandbytes langchain-huggingface ipython

# Install PyTorch with CUDA 11.8 support
```bash
https://pytorch.org/get-started/locally/

uv pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118

4. Download NLTK Resources

The notebook requires specific NLTK data. You can download them by running the following Python code (or these steps will be attempted automatically when you run the notebook):

import nltk
nltk.download('punkt', quiet=True)
nltk.download('punkt_tab', quiet=True)

Test Suite: Includes example and batch tests for pipeline evaluation.

2. Streamlit App (Frontend) and Modular Backend

Features

Upload Any CSV Dataset: Upload your own CSV file with Ramayana verses or any text data
Flexible Column Selection: Choose any column from your CSV as the text source
Dynamic Processing: Embeddings and models adapt to your uploaded data
Local AI Models: No API keys required - everything runs locally
Single Statement Verification: Get TRUE/FALSE/NONE answers with retrieved context

App Structure

The frontend Streamlit app lives in app.py with a modular backend under the ramayana/ package:

ramayana/constants.py — defaults and labels
ramayana/data.py — dataset loading and cleaning
ramayana/embeddings.py — sentence embeddings and PCA
ramayana/retrieval.py — similarity search to fetch contexts
ramayana/prompts.py — prompt templates and rendering
ramayana/llm.py — LLM device/quantization and pipeline loading
ramayana/verification.py — label parsing and verification
ramayana/types.py — typed containers for results

How to Use

Activate your virtual environment
Install dependencies:
```
pip install -r requirements.txt
```
Start the app:
```
streamlit run app.py
```
Upload your CSV: Use the file uploader in the sidebar to upload any CSV file
Select text column: Choose which column contains your text data (e.g., "English_translation", "Verse", "Text", etc.)
Configure models: Optionally change the Sentence-Transformer or LLM model
Verify statements: Enter statements in the main area and get AI-powered verification

Dataset Requirements

Format: CSV file with at least one text column
Content: Any text data (Ramayana verses, religious texts, stories, etc.)
Flexibility: No fixed column names required - you choose the column
Size: Works with datasets of any size (larger datasets take more time for initial processing)

Technical Notes

Embeddings: Auto-generated and cached per dataset using content hash
GPU Support: Automatic CUDA detection with 8-bit quantization when available
CPU Fallback: Works entirely on CPU if no GPU available
Caching: Smart caching system - new uploads trigger complete reprocessing
No API Keys: All models run locally (Sentence-Transformers + Hugging Face models)

Name		Name	Last commit message	Last commit date
Latest commit History 27 Commits
ramayana		ramayana
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
Statements_predictions.csv		Statements_predictions.csv
app.py		app.py
initial.csv		initial.csv
modify_data.ipynb		modify_data.ipynb
ramayana_shloka_dataset.csv		ramayana_shloka_dataset.csv
requirements.txt		requirements.txt
scrape_data.ipynb		scrape_data.ipynb
version_1.ipynb		version_1.ipynb
version_2.ipynb		version_2.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Ramayana-Data Text Analysis

1. Data Scraping & Preparation (`scrape_data.ipynb`)

Source

Data Scraped

Dataset

Core NLP Techniques Used

Prerequisites

Setup Instructions

1. Clone the Repository (Optional)

2. Create and Activate Virtual Environment

3. Install Python Packages

4. Download NLTK Resources

2. Streamlit App (Frontend) and Modular Backend

Features

App Structure

How to Use

Dataset Requirements

Technical Notes

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

parth1609/Ramayana

Folders and files

Latest commit

History

Repository files navigation

Ramayana-Data Text Analysis

1. Data Scraping & Preparation (scrape_data.ipynb)

Source

Data Scraped

Dataset

Core NLP Techniques Used

Prerequisites

Setup Instructions

1. Clone the Repository (Optional)

2. Create and Activate Virtual Environment

3. Install Python Packages

4. Download NLTK Resources

2. Streamlit App (Frontend) and Modular Backend

Features

App Structure

How to Use

Dataset Requirements

Technical Notes

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

1. Data Scraping & Preparation (`scrape_data.ipynb`)

Packages