Quality control (QC) experiments for segmentation reliability. This repo trains UNet models, trains score predictors for QC, and evaluates multiple QC baselines (score-agreement, Mahalanobis, and predictive-entropy) with a shared analysis notebook.
- models
- src/model/unet - U-Net models.
- src/model/score_predictor - score prediction implementation.
- src/model/mahalanobis - Mahalanobis distance model.
- src/model/calibration - calibration utilities.
- src/apps - training and evaluation entrypoints.
- src/notebooks - evaluation notebooks.
- results/ - saved outputs (per-dataset/split/method runs).
- pre-trained/ - pretrained checkpoints.
TODO
UNet training and eval:
Score predictor training and eval (Beta$_{\mu,\kappa}$ QC head on top of UNet):
These scripts compute QC signals and save them into results files for later aggregation:
- Score agreement (SA): src/apps/eval_score_agreement.sh
- Mahalanobis distance (Maha): src/apps/eval_mahalanobis.sh
- Predictive entropy (PE): src/apps/eval_comp_entropy.sh
QC analysis workflow is in src/notebooks/QC_eval.ipynb. It:
- Loads results for multiple datasets/splits and all runs.
- Fits calibrators (thresholding for correlation-based methods; beta adapters for beta-based predictors).
- Computes ranking metrics (Pearson’s
$\rho$ , MAE, eAURC) and risk-control metrics (Rec+ / Rec- at t=0.8, α=0.95).
UNet evaluation is in src/notebooks/unet_eval.ipynb.
- Results are organized by dataset, split, method, and run ID under results/.
- Dataset shifts used in the paper:
- M&Ms scanner drift →
scanner-symphonytim - M&Ms pathology drift →
pathology-norm-vs-fall-scanners-all - PMRI dataset shift →
promise12 - PMRI 3T→1.5T shift →
threet-to-onepointfivet
- M&Ms scanner drift →