[RFC] Improve data reformatting pipeline #86

tallamjr · 2022-03-02T13:18:48Z

Currently, the pipeline to reformat the raw PLAsTiCC data from .csv to the appropriate layout that astronet expects is slow, and takes up many gigabytes of intermediate memory due to limitations with pandas (see layout example below).

>>> df = pd.read_csv("./data/plasticc/training_set.csv")
>>> df.head()
   object_id         mjd  passband        flux   flux_err  detected
0        615  59750.4229         2 -544.810303   3.622952         1
1        615  59750.4306         1 -816.434326   5.553370         1
2        615  59750.4383         3 -471.385529   3.801213         1
3        615  59750.4450         4 -388.984985  11.395031         1
4        615  59752.4070         2 -681.858887   4.041204         1
>>> adf = pd.read_parquet("./data/plasticc/transformed_df_timesteps_100_with_z.parquet")
>>> adf.head()
        mjd     lsstg      lssti      lsstr  ...  object_id  hostgal_photoz  hostgal_photoz_err  target
0  0.000000  2.188022  32.456174  20.774066  ...        730          0.2262              0.0157      42
1  1.210748  2.111767  32.148651  20.629250  ...        730          0.2262              0.0157      42
2  2.421497  2.040803  31.830852  20.477900  ...        730          0.2262              0.0157      42
3  3.632245  1.973938  31.501954  20.320099  ...        730          0.2262              0.0157      42
4  4.842994  1.910180  31.160808  20.156196  ...        730          0.2262              0.0157      42

[5 rows x 11 columns]
>>> df[df["object_id"] == 730].head()
     object_id         mjd  passband       flux  flux_err  detected
702        730  59798.3205         2   1.177371  1.364300         0
703        730  59798.3281         1   2.320849  1.159247         0
704        730  59798.3357         3   2.939447  1.771328         0
705        730  59798.3466         4   2.128097  2.610659         0
706        730  59798.3576         5 -12.809639  5.380097         0

A result of this is reproducing, and testing aspects of the data reformatting pipeline can only be done on the cluster, but even this takes a long time.

This issue is a placeholder for this item to be investigated further, and to look at alternatives to pandas as the data manipulation tool -- it would still be useful to keep pandas as the final DataFrame component for it's interoperability with other libraries.

But, by way of leveraging new standard, in particular apache arrow, the end-to-end processing of the raw data could be dramatically reduced.

The front-runner for this is polars which has been shown to be well suited for this.

Refs:

The text was updated successfully, but these errors were encountered:

tallamjr added 2 - enhancement A request or update to existing functionality 4 - back-burner Item on hold, low priority labels Mar 2, 2022

tallamjr self-assigned this Mar 2, 2022

tallamjr mentioned this issue Mar 2, 2022

[META] Road to v1.0.0 #87

Open

2 tasks

tallamjr pinned this issue Jun 5, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFC] Improve data reformatting pipeline #86

[RFC] Improve data reformatting pipeline #86

tallamjr commented Mar 2, 2022

[RFC] Improve data reformatting pipeline #86

[RFC] Improve data reformatting pipeline #86

Comments

tallamjr commented Mar 2, 2022