ci: Add workflow to automatically retry failed jobs#1826
ci: Add workflow to automatically retry failed jobs#1826jackluo923 wants to merge 5 commits intomainfrom
Conversation
Add a new workflow that monitors other workflows and automatically retries failed jobs on the first attempt. This handles transient failures such as: - Runner acquisition timeouts - Self-hosted runners being temporarily unavailable due to maintenance or failures
WalkthroughNew GitHub Actions workflow added to detect first-time failures of the clp-artifact-build workflow and rerun only the failed jobs using the GitHub CLI on a GitHub-hosted runner. Changes
Sequence Diagram(s)sequenceDiagram
autonumber
participant SourceWorkflow as clp-artifact-build (workflow)
participant GHEvents as GitHub Events
participant RetryWorkflow as retry-failed-jobs (runner)
participant GHCLI as gh CLI
participant GitHubAPI as GitHub API
SourceWorkflow->>GHEvents: emit workflow_run completed (failure, run_attempt=1)
GHEvents->>RetryWorkflow: trigger workflow run
RetryWorkflow->>GHCLI: execute `gh run rerun <workflow_run_id> --failed --repo <owner/repo>` (uses GH_TOKEN)
GHCLI->>GitHubAPI: request rerun of failed jobs
GitHubAPI-->>GHCLI: confirmation / queued reruns
GHCLI-->>RetryWorkflow: exit status
Estimated code review effort🎯 2 (Simple) | ⏱️ ~8 minutes
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
📜 Review details
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
📒 Files selected for processing (1)
.github/workflows/retry-failed-jobs.yaml(1 hunks)
🧰 Additional context used
🧠 Learnings (3)
📓 Common learnings
Learnt from: junhaoliao
Repo: y-scope/clp PR: 1466
File: .github/workflows/clp-rust-checks.yaml:14-15
Timestamp: 2025-10-22T21:14:12.225Z
Learning: Repository y-scope/clp: In GitHub Actions workflows (e.g., .github/workflows/clp-rust-checks.yaml), YAML anchors/aliases are acceptable and preferred to avoid duplication; if actionlint flags an alias node (e.g., on push.paths) as an error, treat it as a tool limitation and do not require inlining unless the team asks to silence the warning.
Learnt from: quinntaylormitchell
Repo: y-scope/clp PR: 918
File: .github/workflows/clp-execution-image-build.yaml:77-97
Timestamp: 2025-05-26T16:03:05.519Z
Learning: In .github/workflows/clp-execution-image-build.yaml, the ubuntu-jammy-execution-image and ubuntu-noble-execution-image jobs are intentionally kept separate (rather than using a matrix strategy) to make it easier to remove individual platform versions when they reach end of life, such as when jammy eventually becomes obsolete.
Learnt from: anlowee
Repo: y-scope/clp PR: 925
File: .github/workflows/clp-s-antlr-generation.yaml:24-27
Timestamp: 2025-05-27T20:04:51.498Z
Learning: The clp codebase uses commit SHAs instead of version tags for GitHub Actions (like actions/checkout) as an established pattern across workflow files.
📚 Learning: 2025-10-22T21:14:12.225Z
Learnt from: junhaoliao
Repo: y-scope/clp PR: 1466
File: .github/workflows/clp-rust-checks.yaml:14-15
Timestamp: 2025-10-22T21:14:12.225Z
Learning: Repository y-scope/clp: In GitHub Actions workflows (e.g., .github/workflows/clp-rust-checks.yaml), YAML anchors/aliases are acceptable and preferred to avoid duplication; if actionlint flags an alias node (e.g., on push.paths) as an error, treat it as a tool limitation and do not require inlining unless the team asks to silence the warning.
Applied to files:
.github/workflows/retry-failed-jobs.yaml
📚 Learning: 2025-05-26T16:03:05.519Z
Learnt from: quinntaylormitchell
Repo: y-scope/clp PR: 918
File: .github/workflows/clp-execution-image-build.yaml:77-97
Timestamp: 2025-05-26T16:03:05.519Z
Learning: In .github/workflows/clp-execution-image-build.yaml, the ubuntu-jammy-execution-image and ubuntu-noble-execution-image jobs are intentionally kept separate (rather than using a matrix strategy) to make it easier to remove individual platform versions when they reach end of life, such as when jammy eventually becomes obsolete.
Applied to files:
.github/workflows/retry-failed-jobs.yaml
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (7)
- GitHub Check: package-image
- GitHub Check: lint-check (macos-15)
- GitHub Check: lint-check (ubuntu-24.04)
- GitHub Check: check-generated
- GitHub Check: package-image
- GitHub Check: check-generated
- GitHub Check: lint-check (macos-15)
🔇 Additional comments (2)
.github/workflows/retry-failed-jobs.yaml (2)
1-15: Excellent documentation!The workflow documentation is clear, comprehensive, and provides good context for future maintainers. The explanation of how it works and instructions for extending to other workflows are particularly helpful.
16-22: LGTM!The workflow_run trigger configuration is correct. Using the
completedtype ensures the workflow triggers after the monitored workflow finishes, with the actual retry logic properly filtered in the job condition.
Description
Add a new workflow that monitors other workflows and automatically retries failed jobs on the first attempt. This handles transient failures such as:
How it works
workflow_runevent)Benefits
workflowslist as neededNote
The
workflow_runevent only triggers workflows on the default branch. Once this PR is merged tomain, the retry workflow will automatically handle failures from workflows running on any branch, including PR branches.Validation performed
workflow_runevent behavior per GitHub Actions documentation