A machine learning system that adapts the propose→solve→verify self-play paradigm from Absolute Zero Reasoner (AZR) (arXiv:2505.03335) to time series forecasting and anomaly detection in energy consumption data.
This Final Year Project explores how self-play reinforcement learning can enhance time series forecasting by training models to propose challenging scenarios, solve them accurately, and verify solutions through realistic constraints. We focus on household energy consumption prediction with validation against real distribution network feeders.
Key Innovation: Unlike traditional supervised learning on historical data, our approach generates synthetic scenarios that stress-test model capabilities while maintaining physical plausibility through verifiable reward signals.
graph TB
subgraph "Raw Data Sources"
A[UK-DALE<br/>Household Energy]
B[London Smart Meters<br/>LCL Dataset]
C[SSEN LV Feeder<br/>Distribution Network]
end
subgraph "Processing Pipeline"
D[Data Harmonization<br/>30-min resolution]
E[Feature Engineering<br/>Weather, Calendar, Lags]
end
subgraph "Self-Play Training"
F[Proposer<br/>Scenario Generation]
G[Solver<br/>TS Forecasting Model]
H[Verifier<br/>Constraint Validation]
end
subgraph "Validation & Evaluation"
I[Pseudo-Feeder<br/>Aggregation]
J[Distributional<br/>Comparison]
K[Anomaly Case<br/>Studies]
end
A --> D
B --> D
D --> E
E --> F
F --> G
G --> H
H --> F
E --> I
C --> J
I --> J
J --> K
- Real-World Grid Validation: Actual SSEN distribution network data (100K consumption readings from operational feeders)
- Latest Architectures: PatchTST and N-BEATS variants with uncertainty quantification
- Verifiable Rewards: Physics-based constraints ensure realistic scenario generation
- Multi-Scale Validation: Household-level accuracy with distribution-feeder-level realism checks
- Production MLOps: DVC data versioning, MLflow experiment tracking, comprehensive CI/CD
- Uncertainty Quantification: Quantile regression heads and Monte Carlo dropout
- Open Science: Reproducible experiments with clear data governance
Current Phase: Data Ingestion & Exploration Next Milestone: Self-Play Prototype Implementation
| Completed | In Progress | Upcoming |
|---|---|---|
|
|
|
| Dataset | Size | Records | Households/Feeders | Purpose |
|---|---|---|---|---|
| LCL (London Smart Meters) | 8.54 GB | ~167M readings | 5,567 households | Training & validation |
| UK-DALE | 6.33 GB | ~114M readings | 5 houses | Appliance-level analysis |
| SSEN (LV Feeder Data) | 37 MB | 100K metadata + 100K consumption | 100K feeders (28 with time-series) | Real-world validation |
| Total | ~15 GB | ~281M readings | 5,572+ entities | — |
All datasets tracked with DVC. SSEN provides actual operational grid data for validating pseudo-feeder realism. See data/README_raw.md for access instructions.
- Python 3.11+
- Poetry for dependency management
- Git with LFS support
# Clone the repository
git clone https://github.com/vatsalmehta/FYP-Predictive_Anomaly_Detection.git
cd FYP-Predictive_Anomaly_Detection
# Install dependencies
poetry install
# Activate virtual environment
poetry shell
# Install pre-commit hooks
pre-commit install
# Pull data if remote configured (optional)
# dvc pull
# Run smoke tests
pytest tests/
# Verify pipeline (placeholder stages)
dvc reproThis project uses DVC (Data Version Control) to manage large datasets while keeping Git repositories lightweight.
# Use built-in synthetic samples (already available)
ls data/samples/
# → lcl_sample.csv, ukdale_sample.csv, ssen_sample.csv# 1. Download datasets (see docs/download_links.md for sources)
# Place in: data/raw/ukdale/, data/raw/lcl/, data/raw/ssen/
# 2. Track with DVC
dvc add data/raw/ukdale
dvc add data/raw/lcl
dvc add data/raw/ssen
# 3. Commit pointers (not data!) to Git
git add data/raw/*.dvc dvc.lock
git commit -m "DVC: track raw datasets via pointers"
# 4. Optional: Set up remote storage for team sharing
dvc remote add -d myremote s3://my-bucket/fyp-data/
dvc pushDataset Locations:
data/raw/ukdale/→ UK-DALE household consumptiondata/raw/lcl/→ London Smart Meters datadata/raw/ssen/→ SSEN distribution feeder datadata/samples/→ Tiny synthetic samples for demos/CI
Resources:
- Dataset download links & setup
- Complete DVC workflow guide
- Ingestion specifications
- Baseline models documentation
# Quick test with samples (no downloads needed)
python -m fyp.ingestion.cli lcl --use-samples
python -m fyp.ingestion.cli ukdale --use-samples
python -m fyp.ingestion.cli ssen --use-samples
# Full ingestion (requires raw data)
python -m fyp.ingestion.cli lcl
python -m fyp.ingestion.cli ukdale --downsample-30min
python -m fyp.ingestion.cli ssen # Uses CKAN API# Quick forecasting baselines on samples
python -m fyp.runner forecast --dataset lcl --use-samples
# Anomaly detection baselines
python -m fyp.runner anomaly --dataset ukdale --use-samples
# Full evaluation with custom horizon
python -m fyp.runner forecast --dataset ssen --horizon 96
# Modern neural models with uncertainty quantification
python -m fyp.runner forecast --dataset lcl --model-type patchtst --use-samples
python -m fyp.runner anomaly --dataset ukdale --model-type autoencoder --use-samples
# Note: Use canonical import path fyp.anomaly.autoencoder
# (old path fyp.models.autoencoder still works but deprecated)# Check code quality
pre-commit run --all-files
# Run full test suite
pytest tests/ -v
# Check pipeline status
dvc status
# View experiment tracking (when available)
mlflow ui├── .github/ # GitHub workflows and issue templates
├── docs/ # Comprehensive documentation
├── notebooks/ # Jupyter notebooks for exploration
├── src/fyp/ # Main package source code
├── tests/ # Test suite
├── data/ # Data directories (DVC tracked)
│ ├── raw/ # Original datasets (gitignored)
│ ├── processed/ # Cleaned and transformed data
│ └── derived/ # Model outputs and artifacts
└── dvc.yaml # DVC pipeline definition
-
No Ground-Truth Anomaly Labels: Datasets lack labeled anomalies. We address this through:
- Physics-based constraints from SSEN
- Self-play learning without labels
- Synthetic test set for quantitative evaluation
-
SSEN Time-Series Data: Currently have feeder metadata only. Time-series consumption requires:
- Research partnership agreement, OR
- API access (pending), OR
- Pseudo-feeder generation from LCL aggregations (our approach)
-
Large Dataset Processing: LCL CSV (8.5GB) requires:
- Chunked reading for memory efficiency
- Parquet conversion for fast queries
- Current implementation tested on 16GB+ RAM
-
HDF5 Dependencies: UK-DALE requires
h5pylibrary and proper HDF5 handling
- Focus on Novelty Over SOTA: This project prioritizes:
- Novel self-play approach to unsupervised anomaly detection
- Physics-informed verification using real network constraints
- Demonstrating feasibility of label-free learning
- NOT achieving state-of-the-art forecasting accuracy
These are documented features, not bugs. See docs/anomaly_strategy.md for our approach.
- No PII Joins: Personal identifiable information is never linked across datasets
- SSEN Validation Only: Distribution network data used solely for external validation
- Anonymized Analysis: All household-level analysis maintains user anonymity
- Data Minimization: Only essential features extracted for modeling purposes
- Transparent Methods: All processing steps documented and reproducible
- Datasets: UK-DALE, London Smart Meters, and SSEN LV Feeder details
- Data Governance: DVC setup, provenance, and retention policies
- Self-Play Design: Propose→solve→verify architecture for time series
- Experiments: MLflow organization and naming conventions
- Feeder Evaluation: Validation methodology against real networks
We welcome contributions! Please see our Contributing Guide for details on:
- Development workflow and branch management
- Code style and testing requirements
- Experiment tracking best practices
Please read our Code of Conduct before participating.
This project is licensed under the MIT License - see the LICENSE file for details.
If you use this work in your research, please cite:
@software{fyp_energy_forecasting,
title = {AZR-inspired Energy Forecasting & Anomaly Detection},
author = {Your Name},
year = {2025},
url = {https://github.com/vatsalmehta2001/FYP-Predictive_Anomaly_Detection}
}See CITATION.cff for complete citation metadata.
- Absolute Zero Reasoner (AZR) - Propose→solve→verify paradigm we adapt
- PatchTST - Patch-based transformer for time series
- N-BEATS - Neural basis expansion analysis for forecasting
- UK-DALE - UK Domestic Appliance-Level Electricity dataset