Skip to content

ci: Add workflow to automatically retry failed jobs#1826

Open
jackluo923 wants to merge 5 commits intomainfrom
ci-reliability
Open

ci: Add workflow to automatically retry failed jobs#1826
jackluo923 wants to merge 5 commits intomainfrom
ci-reliability

Conversation

@jackluo923
Copy link
Member

@jackluo923 jackluo923 commented Dec 19, 2025

Description

Add a new workflow that monitors other workflows and automatically retries failed jobs on the first attempt. This handles transient failures such as:

  • Runner acquisition timeouts ("The job was not acquired by Runner of type self-hosted even after multiple attempts")
  • Self-hosted runners being temporarily unavailable due to maintenance or failures

How it works

  1. Triggers when a monitored workflow completes (via workflow_run event)
  2. If the workflow failed on its first attempt, re-runs only the failed jobs
  3. Uses a GitHub-hosted runner to avoid the same runner acquisition issues

Benefits

  • No maintenance required - doesn't need to know about individual jobs
  • Extensible - add more workflows to the workflows list as needed
  • Decoupled - monitored workflows don't need any changes

Note

The workflow_run event only triggers workflows on the default branch. Once this PR is merged to main, the retry workflow will automatically handle failures from workflows running on any branch, including PR branches.

Validation performed

  • Reviewed workflow syntax
  • Verified workflow_run event behavior per GitHub Actions documentation

Add a new workflow that monitors other workflows and automatically
retries failed jobs on the first attempt. This handles transient
failures such as:
- Runner acquisition timeouts
- Self-hosted runners being temporarily unavailable due to maintenance
  or failures
@jackluo923 jackluo923 requested a review from a team as a code owner December 19, 2025 13:32
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Dec 19, 2025

Walkthrough

New GitHub Actions workflow added to detect first-time failures of the clp-artifact-build workflow and rerun only the failed jobs using the GitHub CLI on a GitHub-hosted runner.

Changes

Cohort / File(s) Change Summary
GitHub Actions Retry Workflow
.github/workflows/retry-failed-jobs.yaml
Added workflow "retry-failed-jobs" that triggers on workflow_run completion for clp-artifact-build; when workflow_run.conclusion == 'failure' and workflow_run.run_attempt == 1, runs gh run rerun <id> --failed --repo <repo> on ubuntu-24.04, passing GH_TOKEN.

Sequence Diagram(s)

sequenceDiagram
    autonumber
    participant SourceWorkflow as clp-artifact-build (workflow)
    participant GHEvents as GitHub Events
    participant RetryWorkflow as retry-failed-jobs (runner)
    participant GHCLI as gh CLI
    participant GitHubAPI as GitHub API

    SourceWorkflow->>GHEvents: emit workflow_run completed (failure, run_attempt=1)
    GHEvents->>RetryWorkflow: trigger workflow run
    RetryWorkflow->>GHCLI: execute `gh run rerun <workflow_run_id> --failed --repo <owner/repo>` (uses GH_TOKEN)
    GHCLI->>GitHubAPI: request rerun of failed jobs
    GitHubAPI-->>GHCLI: confirmation / queued reruns
    GHCLI-->>RetryWorkflow: exit status
Loading

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~8 minutes

  • Verify trigger filter targets only clp-artifact-build and correctly checks workflow_run.run_attempt == 1.
  • Confirm gh command syntax, the workflow_run_id interpolation, and --failed usage.
  • Ensure GH_TOKEN is provided with appropriate scopes and not exposed in logs.
🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and concisely describes the main change: adding a new workflow to automatically retry failed jobs, which aligns perfectly with the changeset.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

📜 Review details

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between b31bbf7 and 07cd933.

📒 Files selected for processing (1)
  • .github/workflows/retry-failed-jobs.yaml (1 hunks)
🧰 Additional context used
🧠 Learnings (3)
📓 Common learnings
Learnt from: junhaoliao
Repo: y-scope/clp PR: 1466
File: .github/workflows/clp-rust-checks.yaml:14-15
Timestamp: 2025-10-22T21:14:12.225Z
Learning: Repository y-scope/clp: In GitHub Actions workflows (e.g., .github/workflows/clp-rust-checks.yaml), YAML anchors/aliases are acceptable and preferred to avoid duplication; if actionlint flags an alias node (e.g., on push.paths) as an error, treat it as a tool limitation and do not require inlining unless the team asks to silence the warning.
Learnt from: quinntaylormitchell
Repo: y-scope/clp PR: 918
File: .github/workflows/clp-execution-image-build.yaml:77-97
Timestamp: 2025-05-26T16:03:05.519Z
Learning: In .github/workflows/clp-execution-image-build.yaml, the ubuntu-jammy-execution-image and ubuntu-noble-execution-image jobs are intentionally kept separate (rather than using a matrix strategy) to make it easier to remove individual platform versions when they reach end of life, such as when jammy eventually becomes obsolete.
Learnt from: anlowee
Repo: y-scope/clp PR: 925
File: .github/workflows/clp-s-antlr-generation.yaml:24-27
Timestamp: 2025-05-27T20:04:51.498Z
Learning: The clp codebase uses commit SHAs instead of version tags for GitHub Actions (like actions/checkout) as an established pattern across workflow files.
📚 Learning: 2025-10-22T21:14:12.225Z
Learnt from: junhaoliao
Repo: y-scope/clp PR: 1466
File: .github/workflows/clp-rust-checks.yaml:14-15
Timestamp: 2025-10-22T21:14:12.225Z
Learning: Repository y-scope/clp: In GitHub Actions workflows (e.g., .github/workflows/clp-rust-checks.yaml), YAML anchors/aliases are acceptable and preferred to avoid duplication; if actionlint flags an alias node (e.g., on push.paths) as an error, treat it as a tool limitation and do not require inlining unless the team asks to silence the warning.

Applied to files:

  • .github/workflows/retry-failed-jobs.yaml
📚 Learning: 2025-05-26T16:03:05.519Z
Learnt from: quinntaylormitchell
Repo: y-scope/clp PR: 918
File: .github/workflows/clp-execution-image-build.yaml:77-97
Timestamp: 2025-05-26T16:03:05.519Z
Learning: In .github/workflows/clp-execution-image-build.yaml, the ubuntu-jammy-execution-image and ubuntu-noble-execution-image jobs are intentionally kept separate (rather than using a matrix strategy) to make it easier to remove individual platform versions when they reach end of life, such as when jammy eventually becomes obsolete.

Applied to files:

  • .github/workflows/retry-failed-jobs.yaml
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (7)
  • GitHub Check: package-image
  • GitHub Check: lint-check (macos-15)
  • GitHub Check: lint-check (ubuntu-24.04)
  • GitHub Check: check-generated
  • GitHub Check: package-image
  • GitHub Check: check-generated
  • GitHub Check: lint-check (macos-15)
🔇 Additional comments (2)
.github/workflows/retry-failed-jobs.yaml (2)

1-15: Excellent documentation!

The workflow documentation is clear, comprehensive, and provides good context for future maintainers. The explanation of how it works and instructions for extending to other workflows are particularly helpful.


16-22: LGTM!

The workflow_run trigger configuration is correct. Using the completed type ensures the workflow triggers after the monitored workflow finishes, with the actual retry logic properly filtered in the job condition.

@jackluo923 jackluo923 requested a review from junhaoliao January 8, 2026 07:42
@junhaoliao junhaoliao self-assigned this Jan 19, 2026
@junhaoliao junhaoliao modified the milestones: Backlog, February 2026 Jan 19, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants