Skip to content

feat: modular pipeline — registries, hook-based trainer, and uncertainty quantification#8

Open
samrat-rm wants to merge 47 commits intoOrion-AI-Lab:mainfrom
samrat-rm:refactor/modularize-pipeline
Open

feat: modular pipeline — registries, hook-based trainer, and uncertainty quantification#8
samrat-rm wants to merge 47 commits intoOrion-AI-Lab:mainfrom
samrat-rm:refactor/modularize-pipeline

Conversation

@samrat-rm
Copy link
Copy Markdown

@samrat-rm samrat-rm commented Mar 19, 2026

Summary

This PR started as a structural refactor but grew into a full pipeline overhaul with real new capabilities alongside the modularization.

Original goal: Restructure the codebase into a clean, extensible pipeline with no functional changes.

What actually landed:

Structural Refactor

  • Moved models, data loaders, loss functions, metrics, and evaluation into separate modules
  • Added BaseModel abstract class with decorator-based MODEL_REGISTRY, loss_registry, and PipelineConfig dataclass with from_json()
  • Added pipeline/runner.py as a clean entry point

Hook-Based Training Architecture

Introduced a TrainingHook extension system — the core loop never changes, features plug in as hooks (on_batch_end, on_epoch_end):

  • UncertaintyHook — tracks mean softmax entropy per epoch during training
  • EntropyRegHook — adds multi-scale entropy regularization per batch
  • AttributionHook — placeholder registered for explainability (GradCAM)

Uncertainty Quantification

Mean softmax entropy is tracked in two places intentionally — training (UncertaintyHook) and validation (validate_all), reported alongside Pixel Accuracy and mIoU. Decreasing validation uncertainty over epochs signals the model gaining confidence; sustained high values indicate the model is struggling.

Entropy Regularization

Without EntropyRegHook, models on imbalanced cloud datasets collapse toward the dominant class (clear sky), inflating accuracy while missing thin clouds. The entropy penalty keeps the model from defaulting to the easy answer.

Explainability (AttributionHook)

Keeps attribution logic decoupled from the training loop. Planned methods: GradCAM++ for spatial activation maps and DeepSHAP for per-feature (band, DEM, weather) contribution scores.


What Changed

1. Project Structure

Reorganized the project into dedicated modules with clear responsibilities:

.
├── configs/          # Pipeline and training configurations
├── data/             # Dataset loaders and augmentation
├── evaluation/       # Metrics and validation logic
├── model_builder/    # Model registry and abstract base class
├── models/           # All model architectures
├── training/         # Trainer, hooks, loss registry
├── utils/            # Shared utilities
└── results/          # Outputs and checkpoints

Split common_metrics.py into:

  • loss.py — loss function definitions
  • validation.py — validation loop logic
  • metrics.py — metric computation

Added __init__.py across all modules for reliable package imports.


2. Model Registry (model_builder/)

  • BaseModel (ABC) — enforces a common interface on all models:
    • forward()
    • from_config()
    • name
  • @register_model decorator — register a new model with one line,
    no changes to core logic
  • get_model(name, config) — single entry point for model instantiation

All models migrated to inherit BaseModel:
UnetModel, SegFormerModel, DeepLabV3Model, SwinUnetModel,
SiameseUNet, SwinCloud, CDnetV2, HRCloudNet, BAM-CD


3. Loss Registry (training/)

  • @register_loss decorator — mirrors the model registry pattern
  • loss.py — loss definitions
  • loss_builders.py — instantiation logic
  • get_loss(name, class_counts, device) — single entry point for loss creation

4. Hook-Based Trainer (training/)

Introduces an extensible training architecture — new behaviors (regularization,
logging, attribution) attach as hooks without touching the core training loop.

Core abstractions:

  • TrainingHook (ABC)
    • on_batch_end() → returns optional auxiliary loss tensor
    • on_epoch_end() → handles logging and scheduling
  • BaseTrainer (ABC) — accumulates hook losses:
  total_loss = seg_loss + sum(hook.on_batch_end())
  • CloudTrainer (concrete) — handles dataloaders, dual-encoder models,
    CDnetV2 auxiliary outputs, early stopping, W&B logging, seeding

Built-in hooks :

  • EntropyRegHook — entropy-based regularization loss
  • UncertaintyHook — logs mean softmax entropy per epoch
  • AttributionHook — placeholder for Grad-CAM++ / DeepSHAP attribution

5. EntropyRegHook — Entropy Regularization to Mitigate Unimodal Dominance

It implements multi-scale functional entropy regularization per batch, directly adopted from Section IV of the paper. Addresses the unimodal dominance problem — where a model collapses to predicting one class (e.g. always "cloudy") because it minimizes loss without learning class boundaries. Penalizes over-confident predictions by adding -λ · H(p) to the total loss each batch, forcing the model to maintain spread across classes throughout training. λ is configurable via lambda_reg in the config.


6. UncertaintyHook — Mean Entropy Uncertainty Logged Per Epoch

It tracks mean softmax entropy across the full validation set after each epoch, computed inside validate_all() and reported alongside IoU and loss:

image

Decreasing uncertainty over epochs indicates a model gaining confidence on unseen data. Sustained high values signal the model is struggling — a diagnostic signal beyond accuracy alone.

Mean softmax entropy is tracked in two separate places intentionally:

  • Training (UncertaintyHook.on_epoch_end) — accumulates per-batch entropy during
    the forward pass and averages it at epoch end. Stored as mean_uncertainty in the
    training metrics dict.
  • Validation (validate_all) — the same computation runs inline over the full
    validation set and is reported on the same line as Pixel Accuracy and mIoU:

7. Pipeline Entry Point (pipeline/)

Provides a unified entry point to construct and validate the full training pipeline with minimal setup.

Core components:

  • runner.py
    • build_pipeline(params_dict) → wires registry, optimizer, loss functions, and hooks into a CloudTrainer in a single call
  • smoke_test.py
    • End-to-end validation script to ensure forward pass, backward pass, and hook accumulation execute correctly on CPU
    • Uses dummy data — no dataset or GPU required
    • Verified working:
      python pipeline/smoke_test.py
      

8. Typed Config (configs/)

Introduces a structured and validated configuration system to replace untyped parameter dictionaries.

Core components:

  • PipelineConfig (dataclass)
    • Strongly typed configuration schema
    • __post_init__() → validates:
      • loss name
      • optimizer type
      • learning rate
      • batch size
      • lambda_reg
  • Utility methods:
    • from_json(path, key) → loads config from JSON with:
      • unknown-field detection
      • missing-key validation
    • to_dict() → converts back to params_dict for compatibility with existing pipeline components

9. Dependencies

Added requirements.txt for one-step environment setup:

pip install -r requirements.txt

What Does NOT Change

Ensures backward compatibility and stability across the existing training ecosystem.

Unaffected components:

  • Data loading, validation, and metrics
    • Existing pipelines remain unchanged and fully operational
  • JSON run configurations
    • All current config files are fully compatible without modification
  • fine_tune_models.py
    • Continues to function using the deprecated train_model() interface
    • No required changes for existing fine-tuning workflows

Pre-refactor .pth checkpoints load correctly into BaseModel wrappers via
automatic key remapping — no retraining required


Backward Compatibility

train_model() is deprecated but retained — still used by fine_tune_models.py.
Migration to the hook-based trainer is tracked in the next steps.


Verified Working

The full pipeline wiring — model instantiation via registry, loss construction,
hook accumulation, and forward + backward pass — has been validated end-to-end
on CPU with no dataset or GPU required.

Smoke test (pipeline/smoke_test.py):

python pipeline/smoke_test.py

Confirms:

  • build_pipeline() correctly wires registry → optimizer → loss → hooks → CloudTrainer
  • Forward pass produces expected output shape
  • on_batch_end() hook losses accumulate into total_loss without errors
  • Backward pass completes cleanly
image

Model test flow (evaluation/model_test.py):

  • Post-refactor checkpoint loading verified with key remapping for BaseModel wrapper compatibility (see commit e6c74fb)

  • Evaluation metrics (accuracy, mIoU, precision, recall, F1) confirmed working through the refactored evaluation/ module

  • Single-pass entropy uncertainty metric (from PR Compute and print mean single-pass uncertainty metrics during model test  #4) confirmed compatible with refactored evaluation path


Next Steps

  • Pipeline config schema — typed dataclass / YAML config for all
    pipeline parameters (model name, loss, hooks, data paths, etc.)
  • Pipeline runner — single entry point: run_pipeline(config)
    that wires model registry → loss registry → trainer → hooks
  • Explainability hook — implement AttributionHook with
    Grad-CAM++ (spatial inputs) and DeepSHAP (scalar/meteorological inputs)
  • Uncertainty hook (advanced) — MC Dropout-based uncertainty estimation [If required]
  • Loader registry — decouple data loader selection from model type;
    config declares data format, registry maps to the right Dataset class
  • Migrate fine_tune_models.py — remove deprecated train_model() path
  • Refine requirements.txt and update README.md






AI Disclosure:
AI assistance (Claude) was used in parts of this PR in the following ways:

  • Idea generation — exploring design patterns (registry, decorator, hook/callback
    architecture) and discussing tradeoffs before settling on an approach
  • Draft implementation — generating initial scaffolding for some components
    (e.g., BaseModel ABC structure, registry decorator pattern, hook interface skeleton)
  • PR description — drafting and structuring this description

All AI-generated code and ideas were manually reviewed, tested, and validated
before being committed. Final implementation decisions and code ownership rest
entirely with the contributor.

…o loss.py and move training files from models/ to data/ and update imports
…y and moved to evaluation dir and imports are updated respectively
…ng : there will be more config files as the project evolves. So, config folder score is increased
…ed in the models dir. Remiving the copy of loaders.
…odel_builder module and updated imports respectively
…ng loop

Add TrainingHook ABC, BaseTrainer, and CloudTrainer to decouple training
extensions (entropy reg, uncertainty logging) from core loop logic.
build_pipeline() wires registry + CloudTrainer + hooks into a single
callable. smoke_test.py verifies forward + backward + hook accumulation
runs cleanly on CPU with dummy data — no real dataset required.
@samrat-rm samrat-rm changed the title Refactor project structure into modular pipeline + add requirements.txt feat: modular pipeline — registries, hook-based trainer, and uncertainty quantification Mar 30, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant