feat: modular pipeline — registries, hook-based trainer, and uncertainty quantification#8
Open
samrat-rm wants to merge 47 commits intoOrion-AI-Lab:mainfrom
Open
feat: modular pipeline — registries, hook-based trainer, and uncertainty quantification#8samrat-rm wants to merge 47 commits intoOrion-AI-Lab:mainfrom
samrat-rm wants to merge 47 commits intoOrion-AI-Lab:mainfrom
Conversation
…o loss.py and move training files from models/ to data/ and update imports
…y and moved to evaluation dir and imports are updated respectively
…ng : there will be more config files as the project evolves. So, config folder score is increased
…ed in the models dir. Remiving the copy of loaders.
…odel_builder module and updated imports respectively
…ng loop Add TrainingHook ABC, BaseTrainer, and CloudTrainer to decouple training extensions (entropy reg, uncertainty logging) from core loop logic.
build_pipeline() wires registry + CloudTrainer + hooks into a single callable. smoke_test.py verifies forward + backward + hook accumulation runs cleanly on CPU with dummy data — no real dataset required.
…now has real values
… unused script_dir
…ertainty tracking
…er.train_epoch overrides this
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This PR started as a structural refactor but grew into a full pipeline overhaul with real new capabilities alongside the modularization.
Original goal: Restructure the codebase into a clean, extensible pipeline with no functional changes.
What actually landed:
Structural Refactor
BaseModelabstract class with decorator-basedMODEL_REGISTRY,loss_registry, andPipelineConfigdataclass withfrom_json()pipeline/runner.pyas a clean entry pointHook-Based Training Architecture
Introduced a
TrainingHookextension system — the core loop never changes, features plug in as hooks (on_batch_end,on_epoch_end):UncertaintyHook— tracks mean softmax entropy per epoch during trainingEntropyRegHook— adds multi-scale entropy regularization per batchAttributionHook— placeholder registered for explainability (GradCAM)Uncertainty Quantification
Mean softmax entropy is tracked in two places intentionally — training (
UncertaintyHook) and validation (validate_all), reported alongside Pixel Accuracy and mIoU. Decreasing validation uncertainty over epochs signals the model gaining confidence; sustained high values indicate the model is struggling.Entropy Regularization
Without
EntropyRegHook, models on imbalanced cloud datasets collapse toward the dominant class (clear sky), inflating accuracy while missing thin clouds. The entropy penalty keeps the model from defaulting to the easy answer.Explainability (AttributionHook)
Keeps attribution logic decoupled from the training loop. Planned methods: GradCAM++ for spatial activation maps and DeepSHAP for per-feature (band, DEM, weather) contribution scores.
What Changed
1. Project Structure
Reorganized the project into dedicated modules with clear responsibilities:
Split
common_metrics.pyinto:loss.py— loss function definitionsvalidation.py— validation loop logicmetrics.py— metric computationAdded
__init__.pyacross all modules for reliable package imports.2. Model Registry (
model_builder/)BaseModel(ABC) — enforces a common interface on all models:forward()from_config()name@register_modeldecorator — register a new model with one line,no changes to core logic
get_model(name, config)— single entry point for model instantiationAll models migrated to inherit
BaseModel:UnetModel,SegFormerModel,DeepLabV3Model,SwinUnetModel,SiameseUNet,SwinCloud,CDnetV2,HRCloudNet,BAM-CD3. Loss Registry (
training/)@register_lossdecorator — mirrors the model registry patternloss.py— loss definitionsloss_builders.py— instantiation logicget_loss(name, class_counts, device)— single entry point for loss creation4. Hook-Based Trainer (
training/)Introduces an extensible training architecture — new behaviors (regularization,
logging, attribution) attach as hooks without touching the core training loop.
Core abstractions:
TrainingHook(ABC)on_batch_end()→ returns optional auxiliary loss tensoron_epoch_end()→ handles logging and schedulingBaseTrainer(ABC) — accumulates hook losses:CloudTrainer(concrete) — handles dataloaders, dual-encoder models,CDnetV2 auxiliary outputs, early stopping, W&B logging, seeding
Built-in hooks :
EntropyRegHook— entropy-based regularization lossUncertaintyHook— logs mean softmax entropy per epochAttributionHook— placeholder for Grad-CAM++ / DeepSHAP attribution5.
EntropyRegHook— Entropy Regularization to Mitigate Unimodal DominanceIt implements multi-scale functional entropy regularization per batch, directly adopted from Section IV of the paper. Addresses the unimodal dominance problem — where a model collapses to predicting one class (e.g. always "cloudy") because it minimizes loss without learning class boundaries. Penalizes over-confident predictions by adding
-λ · H(p)to the total loss each batch, forcing the model to maintain spread across classes throughout training.λis configurable vialambda_regin the config.6.
UncertaintyHook— Mean Entropy Uncertainty Logged Per EpochIt tracks mean softmax entropy across the full validation set after each epoch, computed inside
validate_all()and reported alongside IoU and loss:Decreasing uncertainty over epochs indicates a model gaining confidence on unseen data. Sustained high values signal the model is struggling — a diagnostic signal beyond accuracy alone.
Mean softmax entropy is tracked in two separate places intentionally:
UncertaintyHook.on_epoch_end) — accumulates per-batch entropy duringthe forward pass and averages it at epoch end. Stored as
mean_uncertaintyin thetraining metrics dict.
validate_all) — the same computation runs inline over the fullvalidation set and is reported on the same line as Pixel Accuracy and mIoU:
7. Pipeline Entry Point (
pipeline/)Provides a unified entry point to construct and validate the full training pipeline with minimal setup.
Core components:
runner.pybuild_pipeline(params_dict)→ wires registry, optimizer, loss functions, and hooks into aCloudTrainerin a single callsmoke_test.py8. Typed Config (
configs/)Introduces a structured and validated configuration system to replace untyped parameter dictionaries.
Core components:
PipelineConfig(dataclass)__post_init__()→ validates:lambda_regfrom_json(path, key)→ loads config from JSON with:to_dict()→ converts back toparams_dictfor compatibility with existing pipeline components9. Dependencies
Added
requirements.txtfor one-step environment setup:What Does NOT Change
Ensures backward compatibility and stability across the existing training ecosystem.
Unaffected components:
fine_tune_models.pytrain_model()interfaceBackward Compatibility
train_model()is deprecated but retained — still used byfine_tune_models.py.Migration to the hook-based trainer is tracked in the next steps.
Verified Working
The full pipeline wiring — model instantiation via registry, loss construction,
hook accumulation, and forward + backward pass — has been validated end-to-end
on CPU with no dataset or GPU required.
Smoke test (
pipeline/smoke_test.py):Confirms:
build_pipeline()correctly wires registry → optimizer → loss → hooks →CloudTraineron_batch_end()hook losses accumulate intototal_losswithout errorsModel test flow (
evaluation/model_test.py):Post-refactor checkpoint loading verified with key remapping for
BaseModelwrapper compatibility (see commite6c74fb)Evaluation metrics (
accuracy,mIoU,precision,recall,F1) confirmed working through the refactoredevaluation/moduleSingle-pass entropy uncertainty metric (from PR Compute and print mean single-pass uncertainty metrics during model test #4) confirmed compatible with refactored evaluation path
Next Steps
pipeline parameters (model name, loss, hooks, data paths, etc.)
run_pipeline(config)that wires model registry → loss registry → trainer → hooks
AttributionHookwithGrad-CAM++ (spatial inputs) and DeepSHAP (scalar/meteorological inputs)
config declares data format, registry maps to the right
Datasetclassfine_tune_models.py— remove deprecatedtrain_model()pathrequirements.txtand updateREADME.mdAI Disclosure:
AI assistance (Claude) was used in parts of this PR in the following ways:
architecture) and discussing tradeoffs before settling on an approach
(e.g.,
BaseModelABC structure, registry decorator pattern, hook interface skeleton)All AI-generated code and ideas were manually reviewed, tested, and validated
before being committed. Final implementation decisions and code ownership rest
entirely with the contributor.