-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix(data-warehouse): Handle compaction errors #29462
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PR Summary
This PR enhances Delta Lake compaction job handling in PostHog's data warehouse system, focusing on error resilience and larger dataset support.
- Added error handling in
trigger_compaction_job
to catch and log exceptions without failing parent workflows - Increased compaction job timeout from 5 minutes to 60 minutes to support larger tables
- Added logger parameter to
trigger_compaction_job
for improved error tracking and debugging - Modified pipeline to continue execution even if compaction fails, preventing workflow interruption
2 file(s) reviewed, no comment(s)
Edit PR Review Bot Settings | Greptile
except Exception as e: | ||
capture_exception(e) | ||
logger.exception(f"Compaction job failed with: {e}", exc_info=e) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't we want to re-raise here? Otherwise this will return a workflow-id of a Workflow that failed. But it looks like we are only logging that ID so maybe it's better not to re-raise.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nah, we want to carry on the external data job as normal - we want to avoid compaction job errors making the external data job fail
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense, thank you!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changes look okay, but I'm wondering if we don't want the compaction job to be a child workflow (i.e. use start_child_workflow
instead of start_workflow
).
This would allow cancellations and/or timeouts to propagate to the child (depending on configuration). Currently, the compaction job workflow is independent, meaning it has to be independently cancelled.
Anyways, something for later.
EDIT: I guess this also raises the question of whether we need a child workflow at all, couldn't the compaction job just be another activity?
its in a different workflow due to memory issues. Compaction/vacuuming is a lil leaky and can really spike memory. It's a non-critical part of the workflow, and so offloading it onto a another worker (it has its own pods) seemed sensible
Ah, didn't realise a child workflow was a thing. Something for us to look into and see what the advantages of using that are |
Changes