Pyspark preprocessor outputs to s3 by basavaraj29 · Pull Request #124 · marius-team/marius

basavaraj29 · 2022-11-22T21:19:05Z

the preprocessor now writes processed edge and node data to s3, but the data is split into many files. need to combine them.

the following errors when the files are small,

s3_obj.merge(output_filename, files_list)

throws the error EntityTooSmall.

once we have a single file, we can look into converting that to binary. Alternatively, we can define a custom writer that outputs in binary format without the intermediate csv files.

basavaraj29 added 2 commits November 22, 2022 21:14

spark now writes output files to s3, need to combine them

6911f29

reverting test changes

9ca0a05

basavaraj29 changed the base branch from main to spark-preprocessor-optimizations November 22, 2022 21:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

Pyspark preprocessor outputs to s3#124

Pyspark preprocessor outputs to s3#124
basavaraj29 wants to merge 2 commits intospark-preprocessor-optimizationsfrom
pyspark-preprocessor-use-s3

basavaraj29 commented Nov 22, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Comments

Conversation

basavaraj29 commented Nov 22, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant