Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC] Improve data reformatting pipeline #86

Open
tallamjr opened this issue Mar 2, 2022 · 0 comments
Open

[RFC] Improve data reformatting pipeline #86

tallamjr opened this issue Mar 2, 2022 · 0 comments
Assignees
Labels
2 - enhancement A request or update to existing functionality 4 - back-burner Item on hold, low priority

Comments

@tallamjr
Copy link
Owner

tallamjr commented Mar 2, 2022

Currently, the pipeline to reformat the raw PLAsTiCC data from .csv to the appropriate layout that astronet expects is slow, and takes up many gigabytes of intermediate memory due to limitations with pandas (see layout example below).

>>> df = pd.read_csv("./data/plasticc/training_set.csv")
>>> df.head()
   object_id         mjd  passband        flux   flux_err  detected
0        615  59750.4229         2 -544.810303   3.622952         1
1        615  59750.4306         1 -816.434326   5.553370         1
2        615  59750.4383         3 -471.385529   3.801213         1
3        615  59750.4450         4 -388.984985  11.395031         1
4        615  59752.4070         2 -681.858887   4.041204         1
>>> adf = pd.read_parquet("./data/plasticc/transformed_df_timesteps_100_with_z.parquet")
>>> adf.head()
        mjd     lsstg      lssti      lsstr  ...  object_id  hostgal_photoz  hostgal_photoz_err  target
0  0.000000  2.188022  32.456174  20.774066  ...        730          0.2262              0.0157      42
1  1.210748  2.111767  32.148651  20.629250  ...        730          0.2262              0.0157      42
2  2.421497  2.040803  31.830852  20.477900  ...        730          0.2262              0.0157      42
3  3.632245  1.973938  31.501954  20.320099  ...        730          0.2262              0.0157      42
4  4.842994  1.910180  31.160808  20.156196  ...        730          0.2262              0.0157      42

[5 rows x 11 columns]
>>> df[df["object_id"] == 730].head()
     object_id         mjd  passband       flux  flux_err  detected
702        730  59798.3205         2   1.177371  1.364300         0
703        730  59798.3281         1   2.320849  1.159247         0
704        730  59798.3357         3   2.939447  1.771328         0
705        730  59798.3466         4   2.128097  2.610659         0
706        730  59798.3576         5 -12.809639  5.380097         0

A result of this is reproducing, and testing aspects of the data reformatting pipeline can only be done on the cluster, but even this takes a long time.

This issue is a placeholder for this item to be investigated further, and to look at alternatives to pandas as the data manipulation tool -- it would still be useful to keep pandas as the final DataFrame component for it's interoperability with other libraries.

But, by way of leveraging new standard, in particular apache arrow, the end-to-end processing of the raw data could be dramatically reduced.

The front-runner for this is polars which has been shown to be well suited for this.

Refs:

@tallamjr tallamjr added 2 - enhancement A request or update to existing functionality 4 - back-burner Item on hold, low priority labels Mar 2, 2022
@tallamjr tallamjr self-assigned this Mar 2, 2022
@tallamjr tallamjr mentioned this issue Mar 2, 2022
2 tasks
@tallamjr tallamjr pinned this issue Jun 5, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
2 - enhancement A request or update to existing functionality 4 - back-burner Item on hold, low priority
Projects
None yet
Development

No branches or pull requests

1 participant