Due to double-blinding, download utilities that rely on Box hav been disabled. See Blinding Notes for more information.
ZENODO (DOI) BADGE HERE
Contents:
To get started:
- Get this repository
- Set up your Python environment
- Create datasets (Downloading is usually an option; contact the authors of the repository if needed)
- Train models
- Run inference
- Compare results
See Installation and Demonstrations for more detail.
The Cosmic Microwave Background radiation (CMB) signal is one of the cornerstones upon which modern cosmologists understand the universe. The signal must be separated out from these other natural phenomena which either emit microwave signals or change the CMB signal itself. Modern machine learning and computer vision algorithms are seemingly perfect for the task, but generation of the data is cumbersome and no standard public datasets are available. Models and algorithms created for the task are seldom compared outside the largest collaborations.
The CMB-ML dataset bridges the gap between astrophysics and machine learning. It handles simulation, modeling, and analysis.
This is somewhat complicated. We hope that the structure of CMB-ML gives you an opportunity to focus on a small portion of the pipeline. For many users, we expect this to be the modeling portion. Several examples are presented, showing how different methods can be used to clean the CMB signal. Details are provided below and in ancilliary material for how to acquire the dataset, apply a cleaning method, and use the analysis code included.
Other portions of the pipeline may also be changed. Simulated foregrounds can be changed simply with different parameters for the core engine. With more work, alternative or additional components can be used, or the engine itself can be changed out. A couple noise models particular to the Planck mission have been developed. At the other end of the pipeline, the analysis methods can be altered to match different methods. We are currently improving this portion of the pipeline.
A goal of this project has been to encapsulate the various stages of the pipeline separately from the operational parameters. It is our hope that this enables you to easily compare your results with other methods.
Several tools enable this work. Hydra is used to manage manage a pipeline so that coherent configurations are applied consistently. It uses the PySM3 simulation library in conjunction with CAMB, astropy, and Healpy to handle much of the astrophysics. Three baselines are implemented, with more to follow. One baseline comes from astrophysics: PyILC's implementation of the CNILC method. Another baseline uses machine learning: cmbNNCS's UNet8. A third is a simple PyTorch UNet implementation intended to serve as a template. The analysis portion of the pipeline uses a few simple metrics from scikit-learn along with the astrophysics tools.
The real CMB signal is observed at several microwave wavelengths. To mimic this, we make a ground truth CMB map and several contaminant foregrounds. We "observe" these at the different wavelengths, where each foreground has different levels. Then we apply instrumentation effects to get a set of observed maps. The standard dataset is produced at a low resolution, so that many simulations can be used in a reasonable amount of space.
Two models are included as baselines in this repository. One is a classic astrophysics algorithm, a flavor of internal linear combination methods, which employs cosine needlets (CNILC). The other is a machine learning method (a UNet) implemented and published in the astrophysics domain, CMBNNCS.
The CNILC method was implemented by PyILC, and is described in this paper.
The cmbNNCS method was implemented by cmbNNCS, and is described in this paper.
The third method, the PyTorch implementation of a UNet, is very similar to cmbNNCS and many other published models. Unlike cmbNNCS, it operates on small patches of maps instead of the full sky.
We can compare the CMB predictions to the ground truths in order to determine how well the model works. However, because the models operate in fundamentally different ways, care is needed to ensure that they are compared in a consistent way. We first mask each prediction where the signal is often to bright to get meaningful predictions. We then remove effects of instrumentation from the predictions. The pipeline set up to run each method is then used in a slightly different way, to pull results from each method and produce output which directly compares them. The following figures were produced automatically by the pipeline, for quick review.
Other figures are produced of summary statistics, but these are far more boring (for now!).
If you have somehow stumbled upon this and are not a reviewer, please contact us through GitHub and we will gladly redirect you to the fully functional repository being actively developed.
Download utilities for the dataset have been disabled. Download utilities for external science assets still function. Generation of noise relies on either enabling (in main_sims.py) the noise model creation, or downloading the noise model files from this anonymized Google Drive and putting them into the target Datasets/CMB-ML_512_1450/NoiseModel directory. We've checked that download of these files cannot be known to the authors. Nevertheless, logging out of active Google accounts is recommended.
This repository has one set of simulations in the assets folder (not present in this commit due to file size). These can be placed in Datasets/CMB-ML_512_1450/Simulations/Test for confirming the function of PyILC.
More simulations and both the initial and final cmbNNCS model are in the anonymous Google Drive. Simulations obtained from here need to be placed in Datasets/CMB-ML_512_1450/Simulations/Test. Final cmbNNCS model should be placed in Datasets/CMB-ML_512_1450/CMBNNCS_UNet8/CMBNNCS_D_Model.
We apologize for the inconvenience.
The rest of the README is largely unchanged from the main repository.
We encourage you to first familiarize yourself with the content of the tutorial notebooks and Hydra. Afterwards, you may want to follow either the patterns set in either the classic method or ML method demonstrations. The main difference between these is the amount of stuff you want to do within CMB-ML's pipeline; if you already have code that can take input parameters, the patterns for classic methods may be more appropriate.
At this time, the classic method patterns are non-functional suggestions. To see operational code, the PyILC method works (very well!). Please excuse any confusion caused by the hoops which enable us to run it on many simulations at once. Start with the first top-level script, which gets the pipeline through the cleaning process. Then the second top-level script must be run to finish the process. Both of these scripts use the same configuration file, there is simply a conflict in execution due to settings of matplotlib
.
All of the ML patterns are functional. We suggest using the demonstration network as a prototype. The pipeline overview is in the top-level script. This network operates on patches of sky maps, cut directly from the HEALPix arrangement. Some preprocessing stages are needed to enable fast training. The training and prediction executors follow common PyTorch design patterns (train and predict). Both training and prediction use subclasses of a PyTorch Dataset.
As an alternative, see the cmbNNCS top-level script. The executors for this method are very similar to the demonstration network, though some changes are needed in order to adhere to the method described in the paper. It does differ more significantly in the predict stage, as this model predicts entire skymaps in a single operation.
See next section if you don't want to install CMB-ML, and just want the dataset.
Installation of CMB-ML requires setting up the repository, then getting the data assets for the portion you want to run. Demonstrations are available with practical examples. The early ones cover how to set up CMB-ML to run on your system.
Setting up the repository:
- Clone this repository
- Set up the Python environment, using
conda
- From within the repository, create a "cmb-ml" environment using the included
env.yaml
conda env create -f env.yaml
- Activate the environment
conda activate cmb-ml
- From within the repository, create a "cmb-ml" environment using the included
- Get PyILC
- Simply clone the repository
- No installation is needed, CMB-ML runs the code as its own
- This was run and tested with the version from April 30, 2024
- Configure your local system
- In the configuration files, enter the directories where you will keep datasets and science assets
- In pyilc_redir, edit the
__init__.py
file to point to the directory containing your local installation of pyilc (containing the pyilcinputs.py
andwavelets.py
) - See Setting up your environment for more information
- Download some external science assets and the CMB-ML assets
- External science assets include Planck's observations maps (from which we get information for producing noise) and Planck's NILC prediction map (for the mask; NILC is a parameter)
- These are available from the original sources and a mirror set up for this purpose
- CMB-ML assets include the substitute detector information and information required for downloading datasets
- If you are not creating simulations, you only need one external science asset: "COM_CMB_IQU-nilc_2048_R3.00_full.fits" (for the mask)
- Scripts are available in the
get_data
folder, which will download all files.- Downloads from original sources gets files from the official sources (and the CMB-ML files from this repo)
- If you prefer to download fewer files, adjust this executor (not recommended)
- Next, set up to run.
- You will need to either generate simulations or download them.
- Generating the set of simulations takes considerable time, due to the large number.
- Downloading them is likely to be faster.
- When generating simulations for the first time, PySM3 relies on astropy to download and cache template maps.
- These will be stored in an
.astropy
directory. - Downloading templates is sometimes interrupted resulting in an error and the code crashing. It is annoying and beyond our control. However, because the templates are cached, the pipeline can be resumed and will proceed smoothly.
- These will be stored in an
- Download CMB_ML_512_1450
- Use the downloading script
python ./get_data/get_dataset.py
- Files are visible at this Box link for CMB_ML_512_1450
- Alternatively, to generate simulations, use
python main_sims.py
- To train, predict, and run analysis with the demonstration UNet model
python main_patch_nn.py
- To train, predict, and run analysis using CMBNNCS
python main_cmbnncs.py
- To predict using PyILC (this must be performed separately from analysis due to import issues)
python main_pyilc_predict.py
- To run analysis for PyILC
python main_pyilc_analysis.py
- To compare results between CMBNNCS and PyILC
python main_analysis_compare.py
This will run more quickly than the higher resolution.
- Download CMB_ML_128_1450:
- Use the downloading script
- Change cfg/pipeline/pipe_sim.yaml to use the correct set of shared links. In this yaml, look for
download_sims_reference
and change thepath_template
(replace '512' with '128').
- Change cfg/pipeline/pipe_sim.yaml to use the correct set of shared links. In this yaml, look for
- Files are visible at this Box link for CMB_ML_128_1450
- Alternatively, to generate simulations, use
python main_sims.py dataset_name=CMB_ML_128_1450 nside=128
- Use the downloading script
- Run CMBNNCS on CMB_ML_128_1450 (the smaller UNet5 must be used):
python main_cmbnncs.py dataset_name=CMB_ML_128_1450 working_dir=CMBNNCS_UNet5/ nside=128 num_epochs=2 use_epochs=[2] model/cmbnncs/network=unet5
- Run PyILC on CMB_ML_128_1450:
python main_pyilc_predict.py dataset_name=CMB_ML_128_1450 nside=128 ELLMAX=382 model.pyilc.distinct.N_scales=5 model.pyilc.distinct.ellpeaks=[100,200,300,383]
python main_pyilc_analysis.py dataset_name=CMB_ML_128_1450 nside=128 ELLMAX=382 model.pyilc.distinct.N_scales=5 model.pyilc.distinct.ellpeaks=[100,200,300,383]
- An even faster method is available, using PyILC's HILC method.
- Run Comparison:
python main_analysis_compare.py --config-name config_comp_models_t_128
If you only want to get the dataset, you can use this notebook to download them. It includes a (short) list of required libraries.
CMB-ML manages a complex pipeline that processes data across multiple stages. Each stage produces outputs that need to be tracked, reused, and processed in later stages. Without a clear framework, this can lead to disorganized code, redundant logic, and errors.
The CMB-ML library provides a set of tools to manage the pipeline in a modular and scalable way.
We include a set of demonstrations to help with both installation and introduction to core concepts. The first introduces our approach configuration management. That background paves the way to set up a local configuration and get the required files. Following this are a series of tutorials for the Python objects.
Most of these are in jupyter notebooks:
- Hydra and its use in CMB-ML
- Hydra in scripts (*.py files)
- Setting up your environment
- Getting and looking at simulation instances
- CMB_ML framework: stage code
- CMB_ML framework: pipeline code
- CMB_ML framework: Executors
Only the Setting up your environment is really critical, though the others should help.
I'm interested in hearing what other demonstrations would be helpful. Please let me know what would be helpful. I've considered these notebooks:
- Executors, continued: showing how executors are set up for PyTorch training/inference and matplotlib figure production
- Looking at actual pipeline stages and explaining them
- Paper figure production (available, in another repository, need cleaning)
The below is list of best results on the dataset. Please contact us through this repository to have your results listed. We do ask for the ability to verify those results.
We list below the datasets and model's aggregated (across the Test split) performance. We first calculate each measure for each simulation. The tables below contain average values of those for each metric. The metrics currently implemented are Mean Absolute Error (MAE), Mean Square Error (MSE), Normalized Root Mean Square Error (NRMSE), and Peak Signal-to-Noise Ratio (PSNR). The first three give a general sense of precision. PSNR gives a worst instance measure.
Model | MAE | RMSE | NRMSE | PSNR |
---|---|---|---|---|
CMBNNCS | ||||
CNILC |
CMB-ML was built in the hopes that researchers can compare on this as a standard. In the future, we hope to add more datasets. If you would like your model or dataset listed, please contact us.
None so far!
February 2025:
- The repository history was edited to reduce the
.git
size.- The
.git
information was 300 MB, due to several maps and large python notebooks. - It has been reduced to 21 MB. The bulk of this is images for this README and the demonstration notebooks.
- The
November 2024: New dataset released:
- The noise generation procedure has been revised to be non-white noise
- The detector FWHM's were changed
- Previously they were sub-pixel
- They are now larger and still vary
- More details here
- The CMB signal was changed away from and returned to using CMBLensed
- Because the work is still unpublished and we do not know of anyone else using it, references to previous datasets have been updated. The original dataset will be removed June 30, 2025, unless we're made aware of anyone using it.
Due to double-blinding, links to CMB-ML files are disabled. Simulations must be recreated. See top of README for more information.
We provide links to the various data used. Alternatives to get this data are in get_data
and the Demonstrations
. "Science assets" refers to data created by long-standing cosmological surveys.
-
Science assets
- From the source
- Planck Maps
- Planck Collaboration observation maps include variance maps needed for noise generation:
- Planck Collaboration Observation at 30 GHz
- Planck Collaboration Observation at 44 GHz
- Planck Collaboration Observation at 70 GHz
- Planck Collaboration Observation at 100 GHz
- Planck Collaboration Observation at 143 GHz
- Planck Collaboration Observation at 217 GHz
- Planck Collaboration Observation at 353 GHz
- Planck Collaboration Observation at 545 GHz
- Planck Collaboration Observation at 847 GHz
- For the mask:
- WMAP9 chains for CMB simulation:
- Planck delta bandpass table:
- CMB-ML delta bandpass table:
-
Original delta bandpass table, from Simons Observatory
- CMB-ML modifies these instrumentation properties
- Simply move the CMB-ML directory contained in assets/delta_bandpasses to your assets folder (as defined in e.g., your local_system config)
-
Original delta bandpass table, from Simons Observatory
- Downloading script
- Planck Collaboration observation maps include variance maps needed for noise generation:
- Planck Maps
- On Box:
- All Science Assets
- Script to be replaced if needed. Please send a message if so.
- From the source
-
Datasets
-
CMB_ML_512_1450
- Individual files: Box Link, CMB_ML_512_1450
- Each simulation instance is in its own tar file and will need to be extracted before use
- The power spectra and cosmological parameters are in Simulation_Working.tar.gz
- Log files, including the exact code used to generate simulations, are in Logs.tar.gz. No changes of substance have been made to the code in this archive.
- A script for these download is available here
- Individual files: Box Link, CMB_ML_512_1450
-
CMB_ML_128_1450
- Lower resolution simulations (
$\text{N}_\text{side}=128$ ), for use when testing code and models - Individual instance files: Box Link, CMB_ML_128_1450
- A script for these download is available here
- Change cfg/pipeline/pipe_sim.yaml to use the correct set of shared links. In this yaml, look for download_sims_reference and change the path_template (replace '512' with '128').
- Lower resolution simulations (
-
Files are expected to be in the following folder structure, any other structure requires changes to the pipeline yaml's:
-
└─ Datasets
├─ Simulations
| ├─ Train
| | ├─ sim0000
| | ├─ sim0001
| | └─ etc...
| ├─ Valid
| | ├─ sim0000
| | ├─ sim0001
| | └─ etc...
| └─ Test
| ├─ sim0000
| ├─ sim0001
| └─ etc...
└─ Simulation_Working
├─ Simulation_B_Noise_Cache
├─ Simulation_C_Configs (containing cosmological parameters)
└─ Simulation_CMB_Power_Spectra
- Trained models