Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: raise error in from_parquet when df is empty #264

Merged
merged 2 commits into from
Nov 21, 2023
Merged

Conversation

ireneisdoomed
Copy link
Contributor

This PR raises an error in Dataset.from_parquet if the file is found to be empty, preventing silent failures in data pipelines.

isEmpty() works by invoking limit(1).count(), which means that forces the evaluation of this operation in the Spark plan. I don't think this will have a major impact in the performance, and reading operations are done very early in the pipeline. And, if anything, the step will fail faster.

I've come across this bug a few times, most recently when accessing the credible set folder for L2G. I think that due to incompatible schemas from various sources, querying this folder returned an empty dataframe.

@ireneisdoomed ireneisdoomed requested a review from d0choa November 20, 2023 11:11
Copy link
Contributor

@tskir tskir left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @ireneisdoomed! Looks like a very good check to have

@tskir tskir merged commit 6b33ddd into main Nov 21, 2023
1 check passed
@tskir tskir deleted the il-empty-parquet branch November 21, 2023 09:58
@ireneisdoomed
Copy link
Contributor Author

For the record, after this change the GWASCatalog step went from ~27' to 42'.

@d0choa
Copy link
Collaborator

d0choa commented Nov 22, 2023

@ireneisdoomed we are producing empty dataframes as part of the GWAS catalog harmonisation. Rows that don't comply with the standards (e.g. no beta, no position, etc.) are filtered out resulting on empty dataframes. SummaryStats -> StudyLocus can result on non-significant associations that could create an empty data frame.

@ireneisdoomed
Copy link
Contributor Author

@d0choa I don't necessarily see the problem?

  1. If the path points to all the harmonised results / all studylocus, the resulting df won't be empty.
  2. If it points to a single harmonised study with no significant associations, I think it is positive to flag they are empty.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants