The LINCC Frameworks team has been leading the effort to create interoperable HATS catalogs, usable by the LSDB analytics library. DASH is our attempt to take LSST observations that have been processed by Rubin DM DRP (Data Management Data Reduction Pipeline) into catalogs of objects and sources, and collate them into HATS-partitions parquet files on USDF.
This is not a simple copy of the files, however, as we add value along the way.
There are three primary output tables:
-
dia_object_lc- Combines data from three Butler tables:
dia_object,dia_source,dia_object_forced_source - "Nests" the dia_source and dia_object_forced_source tables as structured lists under each object row. Sources within an object are sorted by detection time.
- Combines data from three Butler tables:
-
object_lc- Combines data from two Butler tables:
object,object_forced_source - "Nests" the object_forced_source table as structured lists under each object row. Sources within an object are sorted by detection time.
- Combines data from two Butler tables:
-
source- Contains data from the Butler table:
source - It is a flat table with the sources detected in science imaging. It is an independent
table because there is no association between objects of the
objecttable and the existing sources.
- Contains data from the Butler table:
We read the catalog data from the same parquet files that Butler uses. We add value as we're reading those files, and in post-processing shortly after.
objecttable - limit to only ~100 columns (from original TODO how many?)- Remove any rows that have nan or null ra/dec (or coord_ra/coord_dec) values.
- Remove any rows from object/source/dia_object_forced_source that are not primary detections.
- We append columns to the results that correspond to Butler dimensions.
In practice, this means that we retain information on tract and patch that
are otherwise not stored either in Butler's backing parquet files, or typically
returned in a call to get a single data table from the Butler (but is inferable
from the Butler data reference).
- dia_object: ["tract"]
- dia_source: ["tract"]
- dia_object_forced_source: ["patch", "tract"]
- object: ["patch", "tract"]
- source: ["band","day_obs","physical_filter","visit"]
- object_forced_source: ["patch", "tract"]
- For all flux (and fluxErr) columns, use the known zero-point to calculate
corresponding mag (and magErr) in nJy:
- dia_source: ["psf", "science"]
- dia_object_forced_source: ["psf"]
- object: ['u_psf', 'u_kron', 'g_psf', 'g_kron', 'r_psf', 'r_kron', 'i_psf', 'i_kron', 'z_psf', 'z_kron', 'y_psf', 'y_kron']
- source: ["psf"]
- object_forced_source: ["psf"]
- Add midpointTai in MJD to source-like tables, where only
visitIdis stored (source, dia_object_forced_source, object_forced_source). We do this with an offline join to thevisitTable.
Once catalogs have been created, we perform some science validation on the catalog data. This is split into two notebooks:
- Basic Statistics
- Gathers global min and max values for all columns, and reports on any that are outside acceptable ranges (e.g. ra values must be (0.0, 360.0))
- Counts the number of null or nans in all columns
- Prints other basic counts, like number of rows, number of columns, number of partition files, and sky coverage in % of the sky.
- By Field
- Performs more detailed verification on the datasets, using LSDB to inspect leaf parquet files, using the six fields used in comcam data (ECDFS, EDFS, RubinSV_38_7, Rubin_SV_95-25, 47_Tuc, Fornax_dSph)
- Compute weighted mean, grouping by field and band, for all four source types
| Stage name | Runtime (HH:MM:SS) |
|---|---|
| 01-Butler.ipynb | 00:03:23 |
| 02-Raw_file_sizes.ipynb | 00:00:17 |
| 03-Import.ipynb | 00:07:05 |
| 04-Post_processing.ipynb | 00:01:46 |
| 05-Nesting.ipynb | 00:02:49 |
| 06.a-Basic_Statistics.ipynb | 00:00:13 |
| 06.b-ByField.ipynb | 00:02:51 |
| 07-Crossmatch_ZTF_PS1.ipynb | 00:02:39 |
| 08-Generate_margins.ipynb | 00:01:18 |
| 09-Generate_index_catalogs.ipynb | 00:00:32 |
| 10-Generate_weekly_JSON.ipynb | 00:00:15 |
| Total runtime | 00:23:08 |
- LSDB (on GitHub) (on ReadTheDocs) (on arXiv)
- HATS (on GitHub) (on ReadTheDocs)
- nested-dask (on GitHub) (on ReadTheDocs)
- nested-pandas (on GitHub) (on ReadTheDocs)
Other useful material: