Skip to content

chore(ci): add automatic rerun controller for flaky workflows#2984

Open
contrueCT wants to merge 1 commit intoapache:masterfrom
contrueCT:pr/ci-rerun-workflow
Open

chore(ci): add automatic rerun controller for flaky workflows#2984
contrueCT wants to merge 1 commit intoapache:masterfrom
contrueCT:pr/ci-rerun-workflow

Conversation

@contrueCT
Copy link
Copy Markdown
Contributor

Purpose of the PR

Apache HugeGraph's four main CI pipelines occasionally fail on transient
issues that pass cleanly on retry, forcing manual "Re-run failed jobs"
clicks.

Main Changes

Adds .github/workflows/rerun-ci.yml, a small controller that watches
the four main CI pipelines via workflow_run and automatically reruns
failed jobs at most once per original failure:

  • Fires only when conclusion == 'failure' AND run_attempt < 2 — no
    infinite loops.
  • Uses gh run rerun <id> --failed to re-run only failed jobs, not the
    whole workflow.
  • Least-privilege permissions: actions: write, contents: read.
  • Scoped to HugeGraph-Server CI, HugeGraph-Commons CI, HugeGraph-PD &
    Store & Hstore CI, and Cluster Test CI — no effect on CodeQL / stale
    / license-checker / auto-pr-review.

Verifying these changes

  • Trivial rework / code cleanup without any test coverage. (No Need)
  • Already covered by existing tests, such as (please modify tests here).
  • Need tests and can be verified as follows:
    • xxx

Validated on my fork: controller correctly fires on failed runs, reruns
only the failed jobs, and does not re-trigger itself on the second
attempt.

Does this PR potentially affect the following parts?

Documentation Status

  • Doc - TODO
  • Doc - Done
  • Doc - No Need

@dosubot dosubot bot added size:M This PR changes 30-99 lines, ignoring generated files. ci-cd Build or deploy labels Apr 8, 2026
@imbajin imbajin changed the title ci: add automatic rerun controller for flaky workflows chore(ci): add automatic rerun controller for flaky workflows Apr 8, 2026
@imbajin imbajin requested a review from Copilot April 8, 2026 15:54
@imbajin imbajin requested a review from VGalaxies April 8, 2026 15:56
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a GitHub Actions controller workflow to automatically rerun flaky CI workflow runs once when they fail, reducing the need for manual “Re-run failed jobs” actions.

Changes:

  • Introduces .github/workflows/rerun-ci.yml listening on workflow_run: completed for the four main CI workflows.
  • Automatically triggers gh run rerun <run_id> --failed when the upstream run fails and is on its first attempt.
  • Sets minimal permissions required to rerun Actions workflows (actions: write, contents: read).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

rerun-failed-jobs:
if: >-
github.event.workflow_run.conclusion == 'failure' &&
fromJSON(github.event.workflow_run.run_attempt) < 2
Copy link

Copilot AI Apr 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

github.event.workflow_run.run_attempt is already a number in the workflow_run payload. Wrapping it in fromJSON(...) will fail expression evaluation (fromJSON expects a string) and can prevent the job from running. Compare the value directly (or use a numeric cast via toJSON/fromJSON only if you first convert to a string).

Suggested change
fromJSON(github.event.workflow_run.run_attempt) < 2
github.event.workflow_run.run_attempt < 2

Copilot uses AI. Check for mistakes.

jobs:
rerun-failed-jobs:
if: >-
Copy link

Copilot AI Apr 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This controller will rerun failed runs for these workflows regardless of what triggered them (e.g., push to master/release-* as well as pull_request). If the intent is only to help flaky PR checks (as described in the issue/PR), consider adding a guard like github.event.workflow_run.event == 'pull_request' (and optionally limiting to the base repo) to avoid consuming extra CI minutes on branch builds.

Suggested change
if: >-
if: >-
github.event.workflow_run.event == 'pull_request' &&

Copilot uses AI. Check for mistakes.
contents: read

jobs:
rerun-failed-jobs:
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ This also matches manually dispatched HugeGraph-Commons CI runs, because workflow_run does not filter on the source event unless we do it explicitly. That means a maintainer debugging a manual run can unexpectedly burn a second run here. Please gate the auto-rerun to the PR/push paths we actually want to heal.

Suggested change
rerun-failed-jobs:
if: >-
github.event.workflow_run.conclusion == 'failure' &&
contains(fromJSON('["push","pull_request"]'), github.event.workflow_run.event) &&
fromJSON(github.event.workflow_run.run_attempt) < 2

name: "Rerun CI"

on:
workflow_run:
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ workflow_run is a privileged trigger, but this controller currently fires for failed runs on any target branch. Because the source workflows use bare pull_request, a PR opened against an ad-hoc maintenance/experiment branch would also execute this write-scoped workflow. Can we narrow the trigger surface to the supported branches here as well?

Suggested change
workflow_run:
on:
workflow_run:
workflows:
- "HugeGraph-Server CI"
- "HugeGraph-Commons CI"
- "HugeGraph-PD & Store & Hstore CI"
- "Cluster Test CI"
types:
- completed
branches:
- master
- 'release-*'
- 'test-*'

@imbajin
Copy link
Copy Markdown
Member

imbajin commented Apr 8, 2026

This PR should target a simple first version of auto-rerun:

  • apply only to the 4 main CI workflows
  • rerun failed jobs when a workflow fails
  • keep retrying until success or the retry limit is reached
  • define K as the number of automatic reruns, with default K = 2
  • this means: 1 original run + up to 2 automatic reruns = up to 3 attempts total
  • keep only basic logging for now

Easy-to-read behavior:

1st failure -> wait a short delay -> rerun
2nd failure -> wait a short delay -> rerun
3rd failure -> stop and leave it for manual investigation
+----------------------+
| Main CI run finished |
+----------+-----------+
           |
           v
   conclusion == failure ?
           |
      +----+----+
      |         |
     no        yes
      |         |
      |   run_attempt <= K ?
      |         |
      |    +----+----+
      |    |         |
      |   no        yes
      |    |         |
      |    |   sleep(delay)
      |    |         |
      |    |   gh run rerun <id> --failed
      |    |
      v    v
     stop  wait for next completed event

5 ⭐️ Required: support K automatic reruns instead of only 1

The current run_attempt < 2 only gives one automatic retry. That is still short of the stated goal of healing flaky / network failures automatically.

If K means the number of automatic reruns, K = 2 is a good default and easy to understand:

  • K = 2
  • attempt 1 fails -> rerun -> attempt 2
  • attempt 2 fails -> rerun -> attempt 3
  • attempt 3 fails -> stop

Reference shape:

env:
  MAX_RERUNS: 2

if: >-
  github.event.workflow_run.conclusion == 'failure' &&
  fromJSON(github.event.workflow_run.run_attempt) <= fromJSON(env.MAX_RERUNS)

That also makes future tuning trivial.

4 ⭐️ Strongly recommended for v1: add a small delay before each rerun

If the failure comes from transient Maven Central / Docker Hub / DNS issues, rerunning immediately often hits the same outage window.

A small delay is likely enough:

  • default 3 minutes, or
  • 5 minutes if you want to be more conservative

Simple reference:

env:
  RETRY_DELAY_SECONDS: 180

steps:
  - name: Wait before rerun
    run: sleep "$RETRY_DELAY_SECONDS"

  - name: Rerun failed jobs
    run: gh run rerun ${{ github.event.workflow_run.id }} --failed

This is not the most runner-minute-efficient approach, but it is simple and predictable for an initial version.

3 ⭐️ Should add: make the basic logging easier to read

The current echo lines are enough for debugging, but not very readable. Writing the decision into GITHUB_STEP_SUMMARY would make this much easier to inspect.

Suggested fields:

  • workflow name
  • event type
  • run id
  • current attempt
  • max reruns
  • delay seconds
  • action: rerun / skip
  • reason: below limit / exceeded limit / non-failure

2 ⭐️ Keep the trigger scope narrow

Since this workflow has actions: write, it is still worth keeping the trigger surface tight:

  • keep the current allowlist of the 4 main CI workflows
  • additionally gate source events to push / pull_request
  • optionally restrict branch scope if needed

That avoids auto-rerunning manual/debug runs unexpectedly.

1 ⭐️ Later improvements, not required for v1

These can wait:

  • different K values per workflow
  • only rerun specific failure categories
  • metrics for auto-rerun hit rate over time
  • posting a PR summary comment

These are useful, but they should not block the initial rollout.

If reduced to one sentence, the most practical v1 is:

default to at most 2 automatic reruns, add a 3-minute gap before each rerun, rerun failed jobs only, and write a basic decision summary.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci-cd Build or deploy size:M This PR changes 30-99 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Feature] automatically rerun failed CI jobs once to mitigate flaky workflows

3 participants