Skip to content

vmehtacode/FYP-Predictive_Anomaly_Detection

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

74 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

AZR-inspired Energy Forecasting & Anomaly Detection

License: MIT Python 3.11+

A machine learning system that adapts the propose→solve→verify self-play paradigm from Absolute Zero Reasoner (AZR) (arXiv:2505.03335) to time series forecasting and anomaly detection in energy consumption data.

Project Vision

This Final Year Project explores how self-play reinforcement learning can enhance time series forecasting by training models to propose challenging scenarios, solve them accurately, and verify solutions through realistic constraints. We focus on household energy consumption prediction with validation against real distribution network feeders.

Key Innovation: Unlike traditional supervised learning on historical data, our approach generates synthetic scenarios that stress-test model capabilities while maintaining physical plausibility through verifiable reward signals.

Data Flow Architecture

graph TB
    subgraph "Raw Data Sources"
        A[UK-DALE<br/>Household Energy]
        B[London Smart Meters<br/>LCL Dataset]
        C[SSEN LV Feeder<br/>Distribution Network]
    end

    subgraph "Processing Pipeline"
        D[Data Harmonization<br/>30-min resolution]
        E[Feature Engineering<br/>Weather, Calendar, Lags]
    end

    subgraph "Self-Play Training"
        F[Proposer<br/>Scenario Generation]
        G[Solver<br/>TS Forecasting Model]
        H[Verifier<br/>Constraint Validation]
    end

    subgraph "Validation & Evaluation"
        I[Pseudo-Feeder<br/>Aggregation]
        J[Distributional<br/>Comparison]
        K[Anomaly Case<br/>Studies]
    end

    A --> D
    B --> D
    D --> E
    E --> F
    F --> G
    G --> H
    H --> F
    E --> I
    C --> J
    I --> J
    J --> K
Loading

USP

  • Real-World Grid Validation: Actual SSEN distribution network data (100K consumption readings from operational feeders)
  • Latest Architectures: PatchTST and N-BEATS variants with uncertainty quantification
  • Verifiable Rewards: Physics-based constraints ensure realistic scenario generation
  • Multi-Scale Validation: Household-level accuracy with distribution-feeder-level realism checks
  • Production MLOps: DVC data versioning, MLflow experiment tracking, comprehensive CI/CD
  • Uncertainty Quantification: Quantile regression heads and Monte Carlo dropout
  • Open Science: Reproducible experiments with clear data governance

Project Status

Current Phase: Data Ingestion & Exploration Next Milestone: Self-Play Prototype Implementation

CompletedIn ProgressUpcoming
  • Data infrastructure (DVC)
  • Datasets acquired (15.2GB)
  • Ingestion pipeline built
  • Baseline models implemented
  • Testing framework
  • Full dataset ingestion
  • Exploratory analysis
  • Anomaly strategy defined
  • SSEN constraint extraction
  • Self-play architecture
  • Proposer/Verifier agents
  • Model training
  • Evaluation & writing

Datasets Overview

Dataset Size Records Households/Feeders Purpose
LCL (London Smart Meters) 8.54 GB ~167M readings 5,567 households Training & validation
UK-DALE 6.33 GB ~114M readings 5 houses Appliance-level analysis
SSEN (LV Feeder Data) 37 MB 100K metadata + 100K consumption 100K feeders (28 with time-series) Real-world validation
Total ~15 GB ~281M readings 5,572+ entities

All datasets tracked with DVC. SSEN provides actual operational grid data for validating pseudo-feeder realism. See data/README_raw.md for access instructions.

Quick Start

Prerequisites

  • Python 3.11+
  • Poetry for dependency management
  • Git with LFS support

Installation

# Clone the repository
git clone https://github.com/vatsalmehta/FYP-Predictive_Anomaly_Detection.git
cd FYP-Predictive_Anomaly_Detection

# Install dependencies
poetry install

# Activate virtual environment
poetry shell

# Install pre-commit hooks
pre-commit install

# Pull data if remote configured (optional)
# dvc pull

# Run smoke tests
pytest tests/

# Verify pipeline (placeholder stages)
dvc repro

Data Onboarding

This project uses DVC (Data Version Control) to manage large datasets while keeping Git repositories lightweight.

For Quick Testing/CI

# Use built-in synthetic samples (already available)
ls data/samples/
# → lcl_sample.csv, ukdale_sample.csv, ssen_sample.csv

For Full Development

# 1. Download datasets (see docs/download_links.md for sources)
#    Place in: data/raw/ukdale/, data/raw/lcl/, data/raw/ssen/

# 2. Track with DVC
dvc add data/raw/ukdale
dvc add data/raw/lcl
dvc add data/raw/ssen

# 3. Commit pointers (not data!) to Git
git add data/raw/*.dvc dvc.lock
git commit -m "DVC: track raw datasets via pointers"

# 4. Optional: Set up remote storage for team sharing
dvc remote add -d myremote s3://my-bucket/fyp-data/
dvc push

Dataset Locations:

  • data/raw/ukdale/ → UK-DALE household consumption
  • data/raw/lcl/ → London Smart Meters data
  • data/raw/ssen/ → SSEN distribution feeder data
  • data/samples/ → Tiny synthetic samples for demos/CI

Resources:

Data Ingestion

# Quick test with samples (no downloads needed)
python -m fyp.ingestion.cli lcl --use-samples
python -m fyp.ingestion.cli ukdale --use-samples
python -m fyp.ingestion.cli ssen --use-samples

# Full ingestion (requires raw data)
python -m fyp.ingestion.cli lcl
python -m fyp.ingestion.cli ukdale --downsample-30min
python -m fyp.ingestion.cli ssen  # Uses CKAN API

Baseline Models

# Quick forecasting baselines on samples
python -m fyp.runner forecast --dataset lcl --use-samples

# Anomaly detection baselines
python -m fyp.runner anomaly --dataset ukdale --use-samples

# Full evaluation with custom horizon
python -m fyp.runner forecast --dataset ssen --horizon 96

# Modern neural models with uncertainty quantification
python -m fyp.runner forecast --dataset lcl --model-type patchtst --use-samples
python -m fyp.runner anomaly --dataset ukdale --model-type autoencoder --use-samples

# Note: Use canonical import path fyp.anomaly.autoencoder
# (old path fyp.models.autoencoder still works but deprecated)

Running Locally

# Check code quality
pre-commit run --all-files

# Run full test suite
pytest tests/ -v

# Check pipeline status
dvc status

# View experiment tracking (when available)
mlflow ui

Project Structure

├── .github/           # GitHub workflows and issue templates
├── docs/              # Comprehensive documentation
├── notebooks/         # Jupyter notebooks for exploration
├── src/fyp/          # Main package source code
├── tests/            # Test suite
├── data/             # Data directories (DVC tracked)
│   ├── raw/          # Original datasets (gitignored)
│   ├── processed/    # Cleaned and transformed data
│   └── derived/      # Model outputs and artifacts
└── dvc.yaml          # DVC pipeline definition

Known Issues & Limitations

Data Limitations

  1. No Ground-Truth Anomaly Labels: Datasets lack labeled anomalies. We address this through:

    • Physics-based constraints from SSEN
    • Self-play learning without labels
    • Synthetic test set for quantitative evaluation
  2. SSEN Time-Series Data: Currently have feeder metadata only. Time-series consumption requires:

    • Research partnership agreement, OR
    • API access (pending), OR
    • Pseudo-feeder generation from LCL aggregations (our approach)

Technical Constraints

  1. Large Dataset Processing: LCL CSV (8.5GB) requires:

    • Chunked reading for memory efficiency
    • Parquet conversion for fast queries
    • Current implementation tested on 16GB+ RAM
  2. HDF5 Dependencies: UK-DALE requires h5py library and proper HDF5 handling

Scope Decisions

  1. Focus on Novelty Over SOTA: This project prioritizes:
    • Novel self-play approach to unsupervised anomaly detection
    • Physics-informed verification using real network constraints
    • Demonstrating feasibility of label-free learning
    • NOT achieving state-of-the-art forecasting accuracy

These are documented features, not bugs. See docs/anomaly_strategy.md for our approach.

Ethics & Privacy

  • No PII Joins: Personal identifiable information is never linked across datasets
  • SSEN Validation Only: Distribution network data used solely for external validation
  • Anonymized Analysis: All household-level analysis maintains user anonymity
  • Data Minimization: Only essential features extracted for modeling purposes
  • Transparent Methods: All processing steps documented and reproducible

Documentation

Contributing

We welcome contributions! Please see our Contributing Guide for details on:

  • Development workflow and branch management
  • Code style and testing requirements
  • Experiment tracking best practices

Please read our Code of Conduct before participating.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Citation

If you use this work in your research, please cite:

@software{fyp_energy_forecasting,
  title = {AZR-inspired Energy Forecasting & Anomaly Detection},
  author = {Your Name},
  year = {2025},
  url = {https://github.com/vatsalmehta2001/FYP-Predictive_Anomaly_Detection}
}

See CITATION.cff for complete citation metadata.

Related Work

About

No description, website, or topics provided.

Resources

License

Code of conduct

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •