Skip to content

Godel-Space/MultimodalUniverse

Β 
Β 

Repository files navigation

image

Multimodal Universe: Enabling Large-Scale Machine Learning with 100TBs of Astronomical Scientific Data

Dataset on Hugging Face NeurIPS arXiv Demo on Colab Test License: MIT All Contributors

Overview

The Multimodal Universe dataset is a large scale collection of multimodal astronomical data, including images, spectra, and light curves, which aims to enable research into foundation models for astrophysics and beyond.

Quick Start

All datasets can be previewed directly from our HuggingFace hub and accessed via load_dataset('MultimodalUniverse/dataset_name')! Preview datasets include ~1k examples from each survey.

from datasets import load_dataset

dset = load_dataset('MultimodalUniverse/plasticc', 
                    split='train', streaming=True)

example = next(iter(dset))

You can try this out with our getting started notebook!

Data Access

To access the full dataset, we recommend downloading the data locally. This is necessary for using the provided cross-matching utilities.

The full dataset content is hosted at the Flatiron Institute and available either through HTTPS or through GLOBUS:

GLOBUS is much preferable when downloading large amounts of data, or a large number of files. Local download of the full data in its native HDF5 format is necessary for using the provided cross-matching utilities.

After downloading the data, you can use Hugging Face's datasets library to load the data directly from your local copy. For example, to load the PLAsTiCC dataset:

from datasets import load_dataset

dset = load_dataset('path/to/downloaded/plasticc', 
                    split='train', streaming=True)
dset = dset.with_format('numpy')

example = next(iter(dset))

Datasets

The Multimodal Universe currently contains data from the following surveys/modalities:

Survey Modality Science Use Case # samples
Legacy Surveys DR10 Images Galaxies 124M
Legacy Surveys North Images Galaxies 15M
HSC Images Galaxies 477k
BTS Images Supernovae 400k
JWST Images Galaxies 300k
Gaia BP/RP Spectra Stars 220M
SDSS-II Spectra Galaxies, Stars 4M
DESI Spectra Galaxies 1M
APOGEE SDSS-III Spectra Stars 716k
GALAH Spectra Stars 325k
Chandra Spectra Galaxies, Stars 129k
VIPERS Spectra Galaxies 91k
MaNGA SDSS-IV Hyperspectral Image Galaxies 12k
PLAsTiCC Time Series Time-varying objects 3.5M
TESS Time Series Exoplanets 160k
CfA Sample Time Series Supernovae 1k
YSE Time Series Supernovae 2k
PS1 SNe Ia Time Series Supernovae 369
DES Y3 SNe Ia Time Series Supernovae 248
SNLS Time Series Supernovae 239
Foundation Time Series Supernovae 180
CSP SNe Ia Time Series Supernovae 134
Swift SNe Ia Time Series Supernovae 117
Gaia Tabular Stars 220M
PROVABGS Tabular Galaxies 221k
Galaxy10 DECaLS Tabular Galaxies 15k

We are accepting new datasets! Check out our contribution guidelines for more details.

Data License

We openly distribute the Multimodal Universe dataset under the Creative Commons Attribution (CC BY) 4.0 license, noting however that when using specific subsets, the license and conditions of utilisation should be respected.

Architecture

Illustration of the methodology behind the Multimodal Universe. Domain scientists with expertise in a given astronomical survey provide data download and formatting scripts through Pull Requests. All datasets are then downloaded from their original source and made available as Hugging Face datasets sharing a common data schema for each modality and associated metadata. End-users can then generate any combination of subsets using provided cross-matching utilities to generate multimodal datasets.

Please see the Design Document for more context about the project.

Citations & Acknowledgements

If you make use of all or part of the Multimodal Universe dataset, please cite the individual datasets accordingly. The relevant BibTeX citations and text acknowledgement instructions for datasets can be generated through the info.py file (python scripts/info.py --help for details).

It allows you to retrieve all of the dataset information, or just acknowledgement and citation information for some or all datasets. By not specifying a dataset, it will return all datasets. By not specifying at least one of --citation or --acknowledge, it will return all of the information (including license, homepage, etc.).

python scripts/info.py --cite --data <datasets>
python scripts/info.py --acknowledge --data <datasets>

For example, to get the citations for the APOGEE and SDSS datasets and save them to info_citation.bib, run:

python scripts/info.py --cite --data apogee sdss -o info_citation.bib
To get all citations and acknowledgements, run:
```sh
python scripts/info.py --cite --acknowledge

You can always specify an output file for easy transcription to your bibliography or acknowledgements section with the --output flag:

python scripts/info.py --cite --output full_citations.txt
python scripts/info.py --acknowledge --output full_acknowledgements.txt

Acknowledgement instructions are returned alongside citations to encourage attribution. The acknowledgement lines are commented with % to make the citations easy to add to your bibliography.

Contributors

Full Contribution List

Francois Lanusse
Francois Lanusse

πŸ“† πŸ’‘ πŸ’»
Liam Parker
Liam Parker

πŸ“† πŸ’‘ πŸ’»
Micah Bowles
Micah Bowles

πŸ“† πŸ’‘ πŸ’»
mhuertascompany
mhuertascompany

πŸ“† πŸ’‘ πŸ’»
Mike Smith
Mike Smith

πŸ“† πŸ’‘ πŸ’»
Helen Qu
Helen Qu

πŸ“† πŸ’‘ πŸ’»
Aaron
Aaron

πŸ’‘ πŸ’»
Ben Boyd
Ben Boyd

πŸ’‘ πŸ’»
Brian Cherinka
Brian Cherinka

πŸ’»
Connor Stone, PhD
Connor Stone, PhD

πŸ’‘
David Chemaly
David Chemaly

πŸ’‘ πŸ’»
Erin Hayes
Erin Hayes

πŸ’‘ πŸ’»
Henry Leung
Henry Leung

πŸ’»
Ioana Ciucă
Ioana Ciucă

πŸ–‹
Jeff Shen
Jeff Shen

πŸ’»
jeraud
jeraud

πŸ’‘ πŸ’»
John F. Wu
John F. Wu

πŸ–‹
CambridgeAstroStat
CambridgeAstroStat

πŸ§‘β€πŸ«
Kartheik Iyer
Kartheik Iyer

πŸ’»
Lucas Meyer
Lucas Meyer

πŸ’»
Matthew Grayling
Matthew Grayling

πŸ’‘ πŸ’»
Maja JabΕ‚oΕ„ska
Maja JabΕ‚oΕ„ska

πŸ’»
Mike Walmsley
Mike Walmsley

πŸ’‘ πŸ’»
Miles Cranmer
Miles Cranmer

πŸ–‹
Peter Melchior
Peter Melchior

πŸ’»
Rafael MartΓ­nez-Galarza
Rafael MartΓ­nez-Galarza

πŸ’»
Tom Hehir
Tom Hehir

πŸ’‘ πŸ’»
Shirley Ho
Shirley Ho

πŸ” πŸ–‹
Mariel Pettee
Mariel Pettee

πŸ€”

About

Large-Scale Multimodal Dataset of Astronomical Data

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Jupyter Notebook 57.4%
  • Python 40.9%
  • Shell 1.2%
  • Makefile 0.5%