Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(data-warehouse): Handle compaction errors #29462

Merged
merged 1 commit into from
Mar 4, 2025

Conversation

Gilbert09
Copy link
Member

Changes

  • If the compaction job errors out, we don't want to fail the external data job workflow too, just carry on as normal
  • Increase the timeout for compaction jobs to 1-hour for larger tables

@Gilbert09 Gilbert09 requested a review from a team March 4, 2025 10:44
Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR Summary

This PR enhances Delta Lake compaction job handling in PostHog's data warehouse system, focusing on error resilience and larger dataset support.

  • Added error handling in trigger_compaction_job to catch and log exceptions without failing parent workflows
  • Increased compaction job timeout from 5 minutes to 60 minutes to support larger tables
  • Added logger parameter to trigger_compaction_job for improved error tracking and debugging
  • Modified pipeline to continue execution even if compaction fails, preventing workflow interruption

2 file(s) reviewed, no comment(s)
Edit PR Review Bot Settings | Greptile

@Gilbert09 Gilbert09 enabled auto-merge (squash) March 4, 2025 10:50
Comment on lines +45 to +47
except Exception as e:
capture_exception(e)
logger.exception(f"Compaction job failed with: {e}", exc_info=e)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't we want to re-raise here? Otherwise this will return a workflow-id of a Workflow that failed. But it looks like we are only logging that ID so maybe it's better not to re-raise.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nah, we want to carry on the external data job as normal - we want to avoid compaction job errors making the external data job fail

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense, thank you!

Copy link
Contributor

@tomasfarias tomasfarias left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes look okay, but I'm wondering if we don't want the compaction job to be a child workflow (i.e. use start_child_workflow instead of start_workflow).

This would allow cancellations and/or timeouts to propagate to the child (depending on configuration). Currently, the compaction job workflow is independent, meaning it has to be independently cancelled.

Anyways, something for later.

EDIT: I guess this also raises the question of whether we need a child workflow at all, couldn't the compaction job just be another activity?

@Gilbert09 Gilbert09 disabled auto-merge March 4, 2025 11:01
@Gilbert09 Gilbert09 merged commit cc26a43 into master Mar 4, 2025
90 checks passed
@Gilbert09 Gilbert09 deleted the tom/compaction-job-error-handling branch March 4, 2025 11:01
@Gilbert09
Copy link
Member Author

Gilbert09 commented Mar 4, 2025

Changes look okay, but I'm wondering if we don't want the compaction job to be a child workflow (i.e. use start_child_workflow instead of start_workflow).

This would allow cancellations and/or timeouts to propagate to the child (depending on configuration). Currently, the compaction job workflow is independent, meaning it has to be independently cancelled.

Anyways, something for later.

EDIT: I guess this also raises the question of whether we need a child workflow at all, couldn't the compaction job just be another activity?

I guess this also raises the question of whether we need a child workflow at all, couldn't the compaction job just be another activity?

its in a different workflow due to memory issues. Compaction/vacuuming is a lil leaky and can really spike memory. It's a non-critical part of the workflow, and so offloading it onto a another worker (it has its own pods) seemed sensible

Changes look okay, but I'm wondering if we don't want the compaction job to be a child workflow (i.e. use start_child_workflow instead of start_workflow).

Ah, didn't realise a child workflow was a thing. Something for us to look into and see what the advantages of using that are

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants