Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added retry logic for transient errors in Cloud Run V2 #17021

Open
wants to merge 11 commits into
base: main
Choose a base branch
from

Conversation

MahmoodAbuGneam
Copy link

@MahmoodAbuGneam MahmoodAbuGneam commented Feb 6, 2025

Pull Request: Implement Retry Logic for Transient Errors in Cloud Run V2 #16448

Closes #16448

Overview

This pull request implements automatic retry logic for transient errors in JobV2.create within _create_job_and_wait_for_registration for Prefect-GCP Cloud Run V2 workers. The goal is to reduce false crash alerts and improve workflow stability when transient failures (e.g., HTTP 503 errors) occur.

Changes Implemented

  • Added retry logic using tenacity to handle transient failures (503, 500).
  • Retries up to 5 times with exponential backoff (multiplier=1, min=2s, max=10s) before failing.
  • Modified _create_job_and_wait_for_registration to use create_job_with_retries() instead of calling JobV2.create directly.
  • Wrote comprehensive unit tests to verify retry behavior.

Checklist

Tests

The following unit tests were added:

  • test_create_job_with_retries_success: Ensures retries work and the job eventually succeeds after 2 failures.
  • test_create_job_with_retries_non_retryable_error: Ensures non-retryable errors (e.g., 400) fail immediately.
  • test_create_job_with_retries_max_attempts: Ensures retries stop after max attempts (5) for persistent 503s.
  • test_is_transient_http_error: Ensures only 500/503 errors trigger retries.

Impact

  • Improves robustness of Cloud Run jobs by handling transient failures.
  • Reduces unnecessary manual intervention and false alerts.
  • Aligns with best practices for cloud service interactions.
  • No breaking changes to existing functionality.

Notes

  • The pre-commit UI/documentation generation hooks fail locally on Windows due to missing Node.js/shell dependencies, but these will pass in CI.
  • The retry parameters (max attempts=5, backoff multiplier=1) were chosen based on common patterns for handling cloud API transient errors.

@github-actions github-actions bot added the docs label Feb 6, 2025
Copy link
Member

@desertaxle desertaxle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @MahmoodAbuGneam! It looks like this PR contains some changes to prefect in addition to prefect-gcp. Based on your description, this looks like it should only include changes to prefect-gcp, so could you revert the changes to prefect in this PR as they seem unrelated?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants