feat: raise error in `from_parquet` when df is empty #264

ireneisdoomed · 2023-11-20T11:01:01Z

This PR raises an error in Dataset.from_parquet if the file is found to be empty, preventing silent failures in data pipelines.

isEmpty() works by invoking limit(1).count(), which means that forces the evaluation of this operation in the Spark plan. I don't think this will have a major impact in the performance, and reading operations are done very early in the pipeline. And, if anything, the step will fail faster.

I've come across this bug a few times, most recently when accessing the credible set folder for L2G. I think that due to incompatible schemas from various sources, querying this folder returned an empty dataframe.

tskir

Thank you @ireneisdoomed! Looks like a very good check to have

ireneisdoomed · 2023-11-22T08:36:44Z

For the record, after this change the GWASCatalog step went from ~27' to 42'.

d0choa · 2023-11-22T14:20:46Z

@ireneisdoomed we are producing empty dataframes as part of the GWAS catalog harmonisation. Rows that don't comply with the standards (e.g. no beta, no position, etc.) are filtered out resulting on empty dataframes. SummaryStats -> StudyLocus can result on non-significant associations that could create an empty data frame.

ireneisdoomed · 2023-11-22T15:24:05Z

@d0choa I don't necessarily see the problem?

If the path points to all the harmonised results / all studylocus, the resulting df won't be empty.
If it points to a single harmonised study with no significant associations, I think it is positive to flag they are empty.

feat: raise error in from_parquet when df is empty

a1c41e6

ireneisdoomed requested a review from d0choa November 20, 2023 11:11

Merge branch 'main' into il-empty-parquet

a4393fd

tskir approved these changes Nov 21, 2023

View reviewed changes

tskir merged commit 6b33ddd into main Nov 21, 2023
1 check passed

tskir deleted the il-empty-parquet branch November 21, 2023 09:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: raise error in `from_parquet` when df is empty #264

feat: raise error in `from_parquet` when df is empty #264

ireneisdoomed commented Nov 20, 2023

tskir left a comment

ireneisdoomed commented Nov 22, 2023

d0choa commented Nov 22, 2023

ireneisdoomed commented Nov 22, 2023

feat: raise error in from_parquet when df is empty #264

feat: raise error in from_parquet when df is empty #264

Conversation

ireneisdoomed commented Nov 20, 2023

tskir left a comment

Choose a reason for hiding this comment

ireneisdoomed commented Nov 22, 2023

d0choa commented Nov 22, 2023

ireneisdoomed commented Nov 22, 2023

feat: raise error in `from_parquet` when df is empty #264

feat: raise error in `from_parquet` when df is empty #264