chore(ci): add automatic rerun controller for flaky workflows by contrueCT · Pull Request #2984 · apache/hugegraph

contrueCT · 2026-04-08T13:13:21Z

Purpose of the PR

close [Feature] automatically rerun failed CI jobs once to mitigate flaky workflows #2983

Apache HugeGraph's four main CI pipelines occasionally fail on transient
issues that pass cleanly on retry, forcing manual "Re-run failed jobs"
clicks.

Main Changes

Adds .github/workflows/rerun-ci.yml, a small controller that watches
the four main CI pipelines via workflow_run and automatically reruns
failed jobs at most once per original failure:

Fires only when conclusion == 'failure' AND run_attempt < 2 — no
infinite loops.
Uses gh run rerun <id> --failed to re-run only failed jobs, not the
whole workflow.
Least-privilege permissions: actions: write, contents: read.
Scoped to HugeGraph-Server CI, HugeGraph-Commons CI, HugeGraph-PD &
Store & Hstore CI, and Cluster Test CI — no effect on CodeQL / stale
/ license-checker / auto-pr-review.

Verifying these changes

Trivial rework / code cleanup without any test coverage. (No Need)
Already covered by existing tests, such as (please modify tests here).
Need tests and can be verified as follows:
- xxx

Validated on my fork: controller correctly fires on failed runs, reruns
only the failed jobs, and does not re-trigger itself on the second
attempt.

Does this PR potentially affect the following parts?

Dependencies (add/update license info & regenerate_known_dependencies.sh)
Modify configurations
The public API
Other affects (typed here)
Nope

Documentation Status

Doc - TODO
Doc - Done
Doc - No Need

Copilot

Pull request overview

Adds a GitHub Actions controller workflow to automatically rerun flaky CI workflow runs once when they fail, reducing the need for manual “Re-run failed jobs” actions.

Changes:

Introduces .github/workflows/rerun-ci.yml listening on workflow_run: completed for the four main CI workflows.
Automatically triggers gh run rerun <run_id> --failed when the upstream run fails and is on its first attempt.
Sets minimal permissions required to rerun Actions workflows (actions: write, contents: read).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-08T15:57:20Z

.github/workflows/rerun-ci.yml

+  rerun-failed-jobs:
+    if: >-
+      github.event.workflow_run.conclusion == 'failure' &&
+      fromJSON(github.event.workflow_run.run_attempt) < 2


github.event.workflow_run.run_attempt is already a number in the workflow_run payload. Wrapping it in fromJSON(...) will fail expression evaluation (fromJSON expects a string) and can prevent the job from running. Compare the value directly (or use a numeric cast via toJSON/fromJSON only if you first convert to a string).

Suggested change

fromJSON(github.event.workflow_run.run_attempt) < 2

github.event.workflow_run.run_attempt < 2

Copilot · 2026-04-08T15:57:21Z

.github/workflows/rerun-ci.yml

+
+jobs:
+  rerun-failed-jobs:
+    if: >-


This controller will rerun failed runs for these workflows regardless of what triggered them (e.g., push to master/release-* as well as pull_request). If the intent is only to help flaky PR checks (as described in the issue/PR), consider adding a guard like github.event.workflow_run.event == 'pull_request' (and optionally limiting to the base repo) to avoid consuming extra CI minutes on branch builds.

Suggested change

if: >-

if: >-

github.event.workflow_run.event == 'pull_request' &&

imbajin · 2026-04-08T15:58:31Z

.github/workflows/rerun-ci.yml

+  contents: read
+
+jobs:
+  rerun-failed-jobs:


⚠️ This also matches manually dispatched HugeGraph-Commons CI runs, because workflow_run does not filter on the source event unless we do it explicitly. That means a maintainer debugging a manual run can unexpectedly burn a second run here. Please gate the auto-rerun to the PR/push paths we actually want to heal.

Suggested change

rerun-failed-jobs:

if: >-

github.event.workflow_run.conclusion == 'failure' &&

contains(fromJSON('["push","pull_request"]'), github.event.workflow_run.event) &&

fromJSON(github.event.workflow_run.run_attempt) < 2

imbajin · 2026-04-08T15:58:31Z

.github/workflows/rerun-ci.yml

+name: "Rerun CI"
+
+on:
+  workflow_run:


⚠️ workflow_run is a privileged trigger, but this controller currently fires for failed runs on any target branch. Because the source workflows use bare pull_request, a PR opened against an ad-hoc maintenance/experiment branch would also execute this write-scoped workflow. Can we narrow the trigger surface to the supported branches here as well?

Suggested change

workflow_run:

on:

workflow_run:

workflows:

- "HugeGraph-Server CI"

- "HugeGraph-Commons CI"

- "HugeGraph-PD & Store & Hstore CI"

- "Cluster Test CI"

types:

- completed

branches:

- master

- 'release-*'

- 'test-*'

imbajin · 2026-04-08T16:07:54Z

This PR should target a simple first version of auto-rerun:

apply only to the 4 main CI workflows
rerun failed jobs when a workflow fails
keep retrying until success or the retry limit is reached
define K as the number of automatic reruns, with default K = 2
this means: 1 original run + up to 2 automatic reruns = up to 3 attempts total
keep only basic logging for now

Easy-to-read behavior:

1st failure -> wait a short delay -> rerun
2nd failure -> wait a short delay -> rerun
3rd failure -> stop and leave it for manual investigation

+----------------------+
| Main CI run finished |
+----------+-----------+
           |
           v
   conclusion == failure ?
           |
      +----+----+
      |         |
     no        yes
      |         |
      |   run_attempt <= K ?
      |         |
      |    +----+----+
      |    |         |
      |   no        yes
      |    |         |
      |    |   sleep(delay)
      |    |         |
      |    |   gh run rerun <id> --failed
      |    |
      v    v
     stop  wait for next completed event

5 ⭐️ Required: support K automatic reruns instead of only 1

The current run_attempt < 2 only gives one automatic retry. That is still short of the stated goal of healing flaky / network failures automatically.

If K means the number of automatic reruns, K = 2 is a good default and easy to understand:

K = 2
attempt 1 fails -> rerun -> attempt 2
attempt 2 fails -> rerun -> attempt 3
attempt 3 fails -> stop

Reference shape:

env:
  MAX_RERUNS: 2

if: >-
  github.event.workflow_run.conclusion == 'failure' &&
  fromJSON(github.event.workflow_run.run_attempt) <= fromJSON(env.MAX_RERUNS)

That also makes future tuning trivial.

4 ⭐️ Strongly recommended for v1: add a small delay before each rerun

If the failure comes from transient Maven Central / Docker Hub / DNS issues, rerunning immediately often hits the same outage window.

A small delay is likely enough:

default 3 minutes, or
5 minutes if you want to be more conservative

Simple reference:

env:
  RETRY_DELAY_SECONDS: 180

steps:
  - name: Wait before rerun
    run: sleep "$RETRY_DELAY_SECONDS"

  - name: Rerun failed jobs
    run: gh run rerun ${{ github.event.workflow_run.id }} --failed

This is not the most runner-minute-efficient approach, but it is simple and predictable for an initial version.

3 ⭐️ Should add: make the basic logging easier to read

The current echo lines are enough for debugging, but not very readable. Writing the decision into GITHUB_STEP_SUMMARY would make this much easier to inspect.

Suggested fields:

workflow name
event type
run id
current attempt
max reruns
delay seconds
action: rerun / skip
reason: below limit / exceeded limit / non-failure

2 ⭐️ Keep the trigger scope narrow

Since this workflow has actions: write, it is still worth keeping the trigger surface tight:

keep the current allowlist of the 4 main CI workflows
additionally gate source events to push / pull_request
optionally restrict branch scope if needed

That avoids auto-rerunning manual/debug runs unexpectedly.

1 ⭐️ Later improvements, not required for v1

These can wait:

different K values per workflow
only rerun specific failure categories
metrics for auto-rerun hit rate over time
posting a PR summary comment

These are useful, but they should not block the initial rollout.

If reduced to one sentence, the most practical v1 is:

default to at most 2 automatic reruns, add a 3-minute gap before each rerun, rerun failed jobs only, and write a basic decision summary.

ci: add automatic rerun controller for flaky workflows

f320ec4

dosubot bot added size:M This PR changes 30-99 lines, ignoring generated files. ci-cd Build or deploy labels Apr 8, 2026

imbajin changed the title ~~ci: add automatic rerun controller for flaky workflows~~ chore(ci): add automatic rerun controller for flaky workflows Apr 8, 2026

imbajin requested a review from Copilot April 8, 2026 15:54

Copilot started reviewing on behalf of imbajin April 8, 2026 15:55 View session

imbajin requested a review from VGalaxies April 8, 2026 15:56

Copilot AI reviewed Apr 8, 2026

View reviewed changes

imbajin reviewed Apr 8, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

chore(ci): add automatic rerun controller for flaky workflows#2984

chore(ci): add automatic rerun controller for flaky workflows#2984
contrueCT wants to merge 1 commit intoapache:masterfrom
contrueCT:pr/ci-rerun-workflow

contrueCT commented Apr 8, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Apr 8, 2026

Uh oh!

Copilot AI Apr 8, 2026

Uh oh!

imbajin Apr 8, 2026

Uh oh!

imbajin Apr 8, 2026

Uh oh!

imbajin commented Apr 8, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

	fromJSON(github.event.workflow_run.run_attempt) < 2
	github.event.workflow_run.run_attempt < 2

	if: >-
	if: >-
	github.event.workflow_run.event == 'pull_request' &&

-  workflow_run:
+on:
+  workflow_run:
+    workflows:
+      - "HugeGraph-Server CI"
+      - "HugeGraph-Commons CI"
+      - "HugeGraph-PD & Store & Hstore CI"
+      - "Cluster Test CI"
+    types:
+      - completed
+    branches:
+      - master
+      - 'release-*'
+      - 'test-*'

Conversation

contrueCT commented Apr 8, 2026

Purpose of the PR

Main Changes

Verifying these changes

Does this PR potentially affect the following parts?

Documentation Status

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Apr 8, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 8, 2026

Choose a reason for hiding this comment

Uh oh!

imbajin Apr 8, 2026

Choose a reason for hiding this comment

Uh oh!

imbajin Apr 8, 2026

Choose a reason for hiding this comment

Uh oh!

imbajin commented Apr 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

imbajin commented Apr 8, 2026 •

edited

Loading