chore(ci): add automatic rerun controller for flaky workflows#2984
chore(ci): add automatic rerun controller for flaky workflows#2984contrueCT wants to merge 1 commit intoapache:masterfrom
Conversation
There was a problem hiding this comment.
Pull request overview
Adds a GitHub Actions controller workflow to automatically rerun flaky CI workflow runs once when they fail, reducing the need for manual “Re-run failed jobs” actions.
Changes:
- Introduces
.github/workflows/rerun-ci.ymllistening onworkflow_run: completedfor the four main CI workflows. - Automatically triggers
gh run rerun <run_id> --failedwhen the upstream run fails and is on its first attempt. - Sets minimal permissions required to rerun Actions workflows (
actions: write,contents: read).
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| rerun-failed-jobs: | ||
| if: >- | ||
| github.event.workflow_run.conclusion == 'failure' && | ||
| fromJSON(github.event.workflow_run.run_attempt) < 2 |
There was a problem hiding this comment.
github.event.workflow_run.run_attempt is already a number in the workflow_run payload. Wrapping it in fromJSON(...) will fail expression evaluation (fromJSON expects a string) and can prevent the job from running. Compare the value directly (or use a numeric cast via toJSON/fromJSON only if you first convert to a string).
| fromJSON(github.event.workflow_run.run_attempt) < 2 | |
| github.event.workflow_run.run_attempt < 2 |
|
|
||
| jobs: | ||
| rerun-failed-jobs: | ||
| if: >- |
There was a problem hiding this comment.
This controller will rerun failed runs for these workflows regardless of what triggered them (e.g., push to master/release-* as well as pull_request). If the intent is only to help flaky PR checks (as described in the issue/PR), consider adding a guard like github.event.workflow_run.event == 'pull_request' (and optionally limiting to the base repo) to avoid consuming extra CI minutes on branch builds.
| if: >- | |
| if: >- | |
| github.event.workflow_run.event == 'pull_request' && |
| contents: read | ||
|
|
||
| jobs: | ||
| rerun-failed-jobs: |
There was a problem hiding this comment.
HugeGraph-Commons CI runs, because workflow_run does not filter on the source event unless we do it explicitly. That means a maintainer debugging a manual run can unexpectedly burn a second run here. Please gate the auto-rerun to the PR/push paths we actually want to heal.
| rerun-failed-jobs: | |
| if: >- | |
| github.event.workflow_run.conclusion == 'failure' && | |
| contains(fromJSON('["push","pull_request"]'), github.event.workflow_run.event) && | |
| fromJSON(github.event.workflow_run.run_attempt) < 2 |
| name: "Rerun CI" | ||
|
|
||
| on: | ||
| workflow_run: |
There was a problem hiding this comment.
workflow_run is a privileged trigger, but this controller currently fires for failed runs on any target branch. Because the source workflows use bare pull_request, a PR opened against an ad-hoc maintenance/experiment branch would also execute this write-scoped workflow. Can we narrow the trigger surface to the supported branches here as well?
| workflow_run: | |
| on: | |
| workflow_run: | |
| workflows: | |
| - "HugeGraph-Server CI" | |
| - "HugeGraph-Commons CI" | |
| - "HugeGraph-PD & Store & Hstore CI" | |
| - "Cluster Test CI" | |
| types: | |
| - completed | |
| branches: | |
| - master | |
| - 'release-*' | |
| - 'test-*' |
|
This PR should target a simple first version of auto-rerun:
Easy-to-read behavior: 5 ⭐️ Required: support K automatic reruns instead of only 1 The current If
Reference shape: env:
MAX_RERUNS: 2
if: >-
github.event.workflow_run.conclusion == 'failure' &&
fromJSON(github.event.workflow_run.run_attempt) <= fromJSON(env.MAX_RERUNS)That also makes future tuning trivial. 4 ⭐️ Strongly recommended for v1: add a small delay before each rerun If the failure comes from transient Maven Central / Docker Hub / DNS issues, rerunning immediately often hits the same outage window. A small delay is likely enough:
Simple reference: env:
RETRY_DELAY_SECONDS: 180
steps:
- name: Wait before rerun
run: sleep "$RETRY_DELAY_SECONDS"
- name: Rerun failed jobs
run: gh run rerun ${{ github.event.workflow_run.id }} --failedThis is not the most runner-minute-efficient approach, but it is simple and predictable for an initial version. 3 ⭐️ Should add: make the basic logging easier to read The current Suggested fields:
2 ⭐️ Keep the trigger scope narrow Since this workflow has
That avoids auto-rerunning manual/debug runs unexpectedly. 1 ⭐️ Later improvements, not required for v1 These can wait:
These are useful, but they should not block the initial rollout. If reduced to one sentence, the most practical v1 is:
|
Purpose of the PR
Apache HugeGraph's four main CI pipelines occasionally fail on transient
issues that pass cleanly on retry, forcing manual "Re-run failed jobs"
clicks.
Main Changes
Adds
.github/workflows/rerun-ci.yml, a small controller that watchesthe four main CI pipelines via
workflow_runand automatically rerunsfailed jobs at most once per original failure:
conclusion == 'failure'ANDrun_attempt < 2— noinfinite loops.
gh run rerun <id> --failedto re-run only failed jobs, not thewhole workflow.
actions: write,contents: read.Store & Hstore CI, and Cluster Test CI — no effect on CodeQL / stale
/ license-checker / auto-pr-review.
Verifying these changes
Validated on my fork: controller correctly fires on failed runs, reruns
only the failed jobs, and does not re-trigger itself on the second
attempt.
Does this PR potentially affect the following parts?
Documentation Status
Doc - TODODoc - DoneDoc - No Need