Skip to content

iccv-author-5412/cmb-ml

Repository files navigation

CMB-ML: A Cosmic Microwave Background Radiation Dataset for Machine Learning

Due to double-blinding, download utilities that rely on Box hav been disabled. See Blinding Notes for more information.

ZENODO (DOI) BADGE HERE

Contents:

Quick Start

To get started:

  • Get this repository
  • Set up your Python environment
  • Create datasets (Downloading is usually an option; contact the authors of the repository if needed)
  • Train models
  • Run inference
  • Compare results

See Installation and Demonstrations for more detail.

Introduction

CMB Radiation Example

The Cosmic Microwave Background radiation (CMB) signal is one of the cornerstones upon which modern cosmologists understand the universe. The signal must be separated out from these other natural phenomena which either emit microwave signals or change the CMB signal itself. Modern machine learning and computer vision algorithms are seemingly perfect for the task, but generation of the data is cumbersome and no standard public datasets are available. Models and algorithms created for the task are seldom compared outside the largest collaborations.

The CMB-ML dataset bridges the gap between astrophysics and machine learning. It handles simulation, modeling, and analysis.

This is somewhat complicated. We hope that the structure of CMB-ML gives you an opportunity to focus on a small portion of the pipeline. For many users, we expect this to be the modeling portion. Several examples are presented, showing how different methods can be used to clean the CMB signal. Details are provided below and in ancilliary material for how to acquire the dataset, apply a cleaning method, and use the analysis code included.

Other portions of the pipeline may also be changed. Simulated foregrounds can be changed simply with different parameters for the core engine. With more work, alternative or additional components can be used, or the engine itself can be changed out. A couple noise models particular to the Planck mission have been developed. At the other end of the pipeline, the analysis methods can be altered to match different methods. We are currently improving this portion of the pipeline.

A goal of this project has been to encapsulate the various stages of the pipeline separately from the operational parameters. It is our hope that this enables you to easily compare your results with other methods.

Several tools enable this work. Hydra is used to manage manage a pipeline so that coherent configurations are applied consistently. It uses the PySM3 simulation library in conjunction with CAMB, astropy, and Healpy to handle much of the astrophysics. Three baselines are implemented, with more to follow. One baseline comes from astrophysics: PyILC's implementation of the CNILC method. Another baseline uses machine learning: cmbNNCS's UNet8. A third is a simple PyTorch UNet implementation intended to serve as a template. The analysis portion of the pipeline uses a few simple metrics from scikit-learn along with the astrophysics tools.

Simulation

CMB Radiation Example

The real CMB signal is observed at several microwave wavelengths. To mimic this, we make a ground truth CMB map and several contaminant foregrounds. We "observe" these at the different wavelengths, where each foreground has different levels. Then we apply instrumentation effects to get a set of observed maps. The standard dataset is produced at a low resolution, so that many simulations can be used in a reasonable amount of space.

Cleaning

Two models are included as baselines in this repository. One is a classic astrophysics algorithm, a flavor of internal linear combination methods, which employs cosine needlets (CNILC). The other is a machine learning method (a UNet) implemented and published in the astrophysics domain, CMBNNCS.

The CNILC method was implemented by PyILC, and is described in this paper.

The cmbNNCS method was implemented by cmbNNCS, and is described in this paper.

The third method, the PyTorch implementation of a UNet, is very similar to cmbNNCS and many other published models. Unlike cmbNNCS, it operates on small patches of maps instead of the full sky.

Analysis

We can compare the CMB predictions to the ground truths in order to determine how well the model works. However, because the models operate in fundamentally different ways, care is needed to ensure that they are compared in a consistent way. We first mask each prediction where the signal is often to bright to get meaningful predictions. We then remove effects of instrumentation from the predictions. The pipeline set up to run each method is then used in a slightly different way, to pull results from each method and produce output which directly compares them. The following figures were produced automatically by the pipeline, for quick review.

Map Cleaning Example Power Spectrum Example

Other figures are produced of summary statistics, but these are far more boring (for now!).

Blinding Notes for Reviewers

If you have somehow stumbled upon this and are not a reviewer, please contact us through GitHub and we will gladly redirect you to the fully functional repository being actively developed.

Download utilities for the dataset have been disabled. Download utilities for external science assets still function. Generation of noise relies on either enabling (in main_sims.py) the noise model creation, or downloading the noise model files from this anonymized Google Drive and putting them into the target Datasets/CMB-ML_512_1450/NoiseModel directory. We've checked that download of these files cannot be known to the authors. Nevertheless, logging out of active Google accounts is recommended.

This repository has one set of simulations in the assets folder (not present in this commit due to file size). These can be placed in Datasets/CMB-ML_512_1450/Simulations/Test for confirming the function of PyILC.

More simulations and both the initial and final cmbNNCS model are in the anonymous Google Drive. Simulations obtained from here need to be placed in Datasets/CMB-ML_512_1450/Simulations/Test. Final cmbNNCS model should be placed in Datasets/CMB-ML_512_1450/CMBNNCS_UNet8/CMBNNCS_D_Model.

We apologize for the inconvenience.

The rest of the README is largely unchanged from the main repository.

New Methods

We encourage you to first familiarize yourself with the content of the tutorial notebooks and Hydra. Afterwards, you may want to follow either the patterns set in either the classic method or ML method demonstrations. The main difference between these is the amount of stuff you want to do within CMB-ML's pipeline; if you already have code that can take input parameters, the patterns for classic methods may be more appropriate.

At this time, the classic method patterns are non-functional suggestions. To see operational code, the PyILC method works (very well!). Please excuse any confusion caused by the hoops which enable us to run it on many simulations at once. Start with the first top-level script, which gets the pipeline through the cleaning process. Then the second top-level script must be run to finish the process. Both of these scripts use the same configuration file, there is simply a conflict in execution due to settings of matplotlib.

All of the ML patterns are functional. We suggest using the demonstration network as a prototype. The pipeline overview is in the top-level script. This network operates on patches of sky maps, cut directly from the HEALPix arrangement. Some preprocessing stages are needed to enable fast training. The training and prediction executors follow common PyTorch design patterns (train and predict). Both training and prediction use subclasses of a PyTorch Dataset.

As an alternative, see the cmbNNCS top-level script. The executors for this method are very similar to the demonstration network, though some changes are needed in order to adhere to the method described in the paper. It does differ more significantly in the predict stage, as this model predicts entire skymaps in a single operation.

Installation

See next section if you don't want to install CMB-ML, and just want the dataset.

Installation of CMB-ML requires setting up the repository, then getting the data assets for the portion you want to run. Demonstrations are available with practical examples. The early ones cover how to set up CMB-ML to run on your system.

Setting up the repository:

  • Clone this repository
  • Set up the Python environment, using conda
    • From within the repository, create a "cmb-ml" environment using the included env.yaml
      • conda env create -f env.yaml
    • Activate the environment
      • conda activate cmb-ml
  • Get PyILC
  • Configure your local system
    • In the configuration files, enter the directories where you will keep datasets and science assets
    • In pyilc_redir, edit the __init__.py file to point to the directory containing your local installation of pyilc (containing the pyilc inputs.py and wavelets.py)
    • See Setting up your environment for more information
  • Download some external science assets and the CMB-ML assets
    • External science assets include Planck's observations maps (from which we get information for producing noise) and Planck's NILC prediction map (for the mask; NILC is a parameter)
    • These are available from the original sources and a mirror set up for this purpose
    • CMB-ML assets include the substitute detector information and information required for downloading datasets
    • If you are not creating simulations, you only need one external science asset: "COM_CMB_IQU-nilc_2048_R3.00_full.fits" (for the mask)
    • Scripts are available in the get_data folder, which will download all files.
  • Next, set up to run.
    • You will need to either generate simulations or download them.

Notes on Running Simulations

  • Generating the set of simulations takes considerable time, due to the large number.
  • Downloading them is likely to be faster.
  • When generating simulations for the first time, PySM3 relies on astropy to download and cache template maps.
    • These will be stored in an .astropy directory.
    • Downloading templates is sometimes interrupted resulting in an error and the code crashing. It is annoying and beyond our control. However, because the templates are cached, the pipeline can be resumed and will proceed smoothly.

For CMB_ML_512_1450

  • Download CMB_ML_512_1450
  • To train, predict, and run analysis with the demonstration UNet model
    • python main_patch_nn.py
  • To train, predict, and run analysis using CMBNNCS
    • python main_cmbnncs.py
  • To predict using PyILC (this must be performed separately from analysis due to import issues)
    • python main_pyilc_predict.py
  • To run analysis for PyILC
    • python main_pyilc_analysis.py
  • To compare results between CMBNNCS and PyILC
    • python main_analysis_compare.py

For CMB_ML_128_1450

This will run more quickly than the higher resolution.

  • Download CMB_ML_128_1450:
  • Run CMBNNCS on CMB_ML_128_1450 (the smaller UNet5 must be used):
    • python main_cmbnncs.py dataset_name=CMB_ML_128_1450 working_dir=CMBNNCS_UNet5/ nside=128 num_epochs=2 use_epochs=[2] model/cmbnncs/network=unet5
  • Run PyILC on CMB_ML_128_1450:
    • python main_pyilc_predict.py dataset_name=CMB_ML_128_1450 nside=128 ELLMAX=382 model.pyilc.distinct.N_scales=5 model.pyilc.distinct.ellpeaks=[100,200,300,383]
    • python main_pyilc_analysis.py dataset_name=CMB_ML_128_1450 nside=128 ELLMAX=382 model.pyilc.distinct.N_scales=5 model.pyilc.distinct.ellpeaks=[100,200,300,383]
    • An even faster method is available, using PyILC's HILC method.
  • Run Comparison:
    • python main_analysis_compare.py --config-name config_comp_models_t_128

Dataset Only

If you only want to get the dataset, you can use this notebook to download them. It includes a (short) list of required libraries.

Demonstrations

CMB-ML manages a complex pipeline that processes data across multiple stages. Each stage produces outputs that need to be tracked, reused, and processed in later stages. Without a clear framework, this can lead to disorganized code, redundant logic, and errors.

The CMB-ML library provides a set of tools to manage the pipeline in a modular and scalable way.

We include a set of demonstrations to help with both installation and introduction to core concepts. The first introduces our approach configuration management. That background paves the way to set up a local configuration and get the required files. Following this are a series of tutorials for the Python objects.

Most of these are in jupyter notebooks:

Only the Setting up your environment is really critical, though the others should help.

I'm interested in hearing what other demonstrations would be helpful. Please let me know what would be helpful. I've considered these notebooks:

  • Executors, continued: showing how executors are set up for PyTorch training/inference and matplotlib figure production
  • Looking at actual pipeline stages and explaining them
  • Paper figure production (available, in another repository, need cleaning)

Comparing Results

The below is list of best results on the dataset. Please contact us through this repository to have your results listed. We do ask for the ability to verify those results.

We list below the datasets and model's aggregated (across the Test split) performance. We first calculate each measure for each simulation. The tables below contain average values of those for each metric. The metrics currently implemented are Mean Absolute Error (MAE), Mean Square Error (MSE), Normalized Root Mean Square Error (NRMSE), and Peak Signal-to-Noise Ratio (PSNR). The first three give a general sense of precision. PSNR gives a worst instance measure.

On TQU-512-1450

Pixel Space Performance

Model MAE RMSE NRMSE PSNR
CMBNNCS $\bf{25.25 \pm 0.29}$ $\bf{31.69 \pm 0.36}$ $\bf{0.3039 \pm 0.0040}$ $\bf{30.25 \pm 0.33}$
CNILC $32.28 \pm 0.44$ $40.52 \pm 0.55$ $0.3885 \pm 0.0043$ $28.89 \pm 0.60$

Outside Works

CMB-ML was built in the hopes that researchers can compare on this as a standard. In the future, we hope to add more datasets. If you would like your model or dataset listed, please contact us.

Works using datasets from this repository

None so far!

Errata

February 2025:

  • The repository history was edited to reduce the .git size.
    • The .git information was 300 MB, due to several maps and large python notebooks.
    • It has been reduced to 21 MB. The bulk of this is images for this README and the demonstration notebooks.

November 2024: New dataset released:

  • The noise generation procedure has been revised to be non-white noise
  • The detector FWHM's were changed
    • Previously they were sub-pixel
    • They are now larger and still vary
    • More details here
  • The CMB signal was changed away from and returned to using CMBLensed
  • Because the work is still unpublished and we do not know of anyone else using it, references to previous datasets have been updated. The original dataset will be removed June 30, 2025, unless we're made aware of anyone using it.

Data File Links

Due to double-blinding, links to CMB-ML files are disabled. Simulations must be recreated. See top of README for more information.

We provide links to the various data used. Alternatives to get this data are in get_data and the Demonstrations. "Science assets" refers to data created by long-standing cosmological surveys.

└─ Datasets
   ├─ Simulations
   |   ├─ Train
   |   |     ├─ sim0000
   |   |     ├─ sim0001
   |   |     └─ etc...
   |   ├─ Valid
   |   |     ├─ sim0000
   |   |     ├─ sim0001
   |   |     └─ etc...
   |   └─ Test
   |         ├─ sim0000
   |         ├─ sim0001
   |         └─ etc...
   └─ Simulation_Working
       ├─ Simulation_B_Noise_Cache
       ├─ Simulation_C_Configs            (containing cosmological parameters)
       └─ Simulation_CMB_Power_Spectra

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published