fix(data-warehouse): Handle compaction errors #29462

Gilbert09 · 2025-03-04T10:44:56Z

Changes

If the compaction job errors out, we don't want to fail the external data job workflow too, just carry on as normal
Increase the timeout for compaction jobs to 1-hour for larger tables

greptile-apps

PR Summary

This PR enhances Delta Lake compaction job handling in PostHog's data warehouse system, focusing on error resilience and larger dataset support.

Added error handling in trigger_compaction_job to catch and log exceptions without failing parent workflows
Increased compaction job timeout from 5 minutes to 60 minutes to support larger tables
Added logger parameter to trigger_compaction_job for improved error tracking and debugging
Modified pipeline to continue execution even if compaction fails, preventing workflow interruption

_{2 file(s) reviewed, no comment(s)}
_{Edit PR Review Bot Settings | Greptile}

tomasfarias · 2025-03-04T10:54:43Z

posthog/temporal/data_imports/deltalake_compaction_job.py

+            except Exception as e:
+                capture_exception(e)
+                logger.exception(f"Compaction job failed with: {e}", exc_info=e)


Don't we want to re-raise here? Otherwise this will return a workflow-id of a Workflow that failed. But it looks like we are only logging that ID so maybe it's better not to re-raise.

Nah, we want to carry on the external data job as normal - we want to avoid compaction job errors making the external data job fail

Makes sense, thank you!

tomasfarias

Changes look okay, but I'm wondering if we don't want the compaction job to be a child workflow (i.e. use start_child_workflow instead of start_workflow).

This would allow cancellations and/or timeouts to propagate to the child (depending on configuration). Currently, the compaction job workflow is independent, meaning it has to be independently cancelled.

Anyways, something for later.

EDIT: I guess this also raises the question of whether we need a child workflow at all, couldn't the compaction job just be another activity?

Gilbert09 · 2025-03-04T11:03:26Z

Changes look okay, but I'm wondering if we don't want the compaction job to be a child workflow (i.e. use start_child_workflow instead of start_workflow).

This would allow cancellations and/or timeouts to propagate to the child (depending on configuration). Currently, the compaction job workflow is independent, meaning it has to be independently cancelled.

Anyways, something for later.

EDIT: I guess this also raises the question of whether we need a child workflow at all, couldn't the compaction job just be another activity?

I guess this also raises the question of whether we need a child workflow at all, couldn't the compaction job just be another activity?

its in a different workflow due to memory issues. Compaction/vacuuming is a lil leaky and can really spike memory. It's a non-critical part of the workflow, and so offloading it onto a another worker (it has its own pods) seemed sensible

Changes look okay, but I'm wondering if we don't want the compaction job to be a child workflow (i.e. use start_child_workflow instead of start_workflow).

Ah, didn't realise a child workflow was a thing. Something for us to look into and see what the advantages of using that are

Handle compaction errors

72c2fa4

Gilbert09 requested a review from a team March 4, 2025 10:44

greptile-apps bot reviewed Mar 4, 2025

View reviewed changes

joshsny approved these changes Mar 4, 2025

View reviewed changes

Gilbert09 enabled auto-merge (squash) March 4, 2025 10:50

tomasfarias reviewed Mar 4, 2025

View reviewed changes

tomasfarias approved these changes Mar 4, 2025

View reviewed changes

Gilbert09 disabled auto-merge March 4, 2025 11:01

Gilbert09 merged commit cc26a43 into master Mar 4, 2025
90 checks passed

Gilbert09 deleted the tom/compaction-job-error-handling branch March 4, 2025 11:01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(data-warehouse): Handle compaction errors #29462

fix(data-warehouse): Handle compaction errors #29462

Gilbert09 commented Mar 4, 2025

greptile-apps bot left a comment

tomasfarias Mar 4, 2025

Gilbert09 Mar 4, 2025

tomasfarias Mar 4, 2025

tomasfarias left a comment •

edited

Loading

Gilbert09 commented Mar 4, 2025 •

edited

Loading

fix(data-warehouse): Handle compaction errors #29462

fix(data-warehouse): Handle compaction errors #29462

Conversation

Gilbert09 commented Mar 4, 2025

Changes

greptile-apps bot left a comment

Choose a reason for hiding this comment

PR Summary

tomasfarias Mar 4, 2025

Choose a reason for hiding this comment

Gilbert09 Mar 4, 2025

Choose a reason for hiding this comment

tomasfarias Mar 4, 2025

Choose a reason for hiding this comment

tomasfarias left a comment • edited Loading

Choose a reason for hiding this comment

Gilbert09 commented Mar 4, 2025 • edited Loading

tomasfarias left a comment •

edited

Loading

Gilbert09 commented Mar 4, 2025 •

edited

Loading