Skip to content

arnav-kr/aadhaar-stats

Repository files navigation

Aadhaar Data Analysis

This repository contains scripts and data for analysing Aadhaar enrolment, biometric, and demographic data.

Setup

Prerequisites

  • Python 3.8+
  • UV (recommended)
  • Typst (for report compilation)

Tip

To install UV, follow the instructions in the official documentation. To install Typst, see typst.app.

Clone the repository

git clone https://github.com/arnav-kr/aadhaar-stats.git
cd aadhaar-stats

Sync Environment & Install Dependencies

uv sync

Usage

Run Full Pipeline

The main script runs all analysis stages and compiles the final report:

uv run main.py

Options

uv run main.py --skip-analysis  # only compile report
uv run main.py --skip-report    # only run analysis scripts
uv run main.py --assistant      # launch AI assistant

AI Assistant

The project includes an AI-powered assistant for exploring the analysis data interactively.

Setup

  1. Create a .env file in the project root with your API key:

    AI_API_KEY=your-api-key-here
    AI_MODEL=gemini-3-flash-preview
  2. Get an API key from Google AI Studio

Usage

uv run main.py --assistant

The assistant can answer questions about:

  • Enrolment statistics and trends
  • State and district comparisons
  • Migration patterns
  • Data quality metrics
  • Anomaly detection results
  • And more...

Run Individual Scripts

uv run scripts/preprocess.py
uv run scripts/univariate.py
# etc.

Project Structure

├── main.py                 # Main pipeline script
├── assistant/              # AI-powered data exploration assistant
│   ├── __init__.py
│   ├── chat.py             # Chat interface using Gemini
│   └── data_provider.py    # Local data context provider
├── data/
│   ├── raw/                # Raw Aadhaar CSV files
│   │   ├── enrolment/      # New enrolment records
│   │   ├── demographic/    # Demographic update records
│   │   └── biometric/      # Biometric update records
│   ├── processed/          # Cleaned and normalized data
│   ├── intermediate/       # Intermediate processing artifacts
│   └── maps/               # Geographic boundary files (shapefiles, geojson)
├── scripts/
│   ├── preprocess.py       # Data cleaning and normalization
│   ├── univariate.py       # Single-variable analysis
│   ├── bivariate.py        # Two-variable relationship analysis
│   ├── trivariate.py       # Three-variable interaction analysis
│   ├── data_quality.py     # Data quality assessment
│   ├── advanced.py         # Advanced insights and forecasting
│   ├── spatial.py          # Geographic visualizations
│   └── utils/              # Shared utilities and constants
├── plots/
│   ├── univariate/         # Single-variable plots
│   ├── bivariate/          # Two-variable plots
│   ├── trivariate/         # Three-variable plots
│   ├── data_quality/       # Data quality visualizations
│   └── advanced/           # Advanced analysis plots
├── analysis/               # JSON outputs from analysis scripts
├── descriptions/           # YAML descriptions for plots and analysis
└── report/
    ├── main.typ            # Typst source document
    └── report.pdf          # Compiled PDF report (generated)

Scripts

Script Description
preprocess.py Loads raw CSVs, normalizes state/district names, validates pincodes, parses dates
univariate.py State-wise distribution, age groups, temporal trends, activity patterns
bivariate.py Correlation analysis, state-age relationships, migration patterns
trivariate.py State-time-enrolment clustering, age-time dynamics, anomaly detection
data_quality.py Spelling variations, naming inconsistencies, data entry issues
advanced.py Demand forecasting, migration corridors, fraud indicators, resource allocation
spatial.py Geographic map visualizations using shapefiles

Output

  • 67 plots across 5 analysis categories
  • JSON analysis files with computed statistics
  • 71 page PDF report with findings and recommendations

License

This project is licensed under the AGPL-3.0 License. See the LICENSE file for details.

About

Analysis of Aadhaar Enrolment, Demographic and Biometric Data

Topics

Resources

License

Stars

Watchers

Forks

Contributors