Skip to content

[Improve][CI] Avoid repeated environment downloads in loader-ci #723

@imbajin

Description

@imbajin

Observed while rerunning loader-ci during PR #716 review.

Problem

The Prepare env and service step in loader-ci appears to spend a large amount of time repeatedly downloading or rebuilding external dependencies on each run, even when the versions do not change.

From the current workflow:

  • .github/workflows/loader-ci.yml only caches ~/.m2
  • hugegraph-loader/assembly/travis/install-hadoop.sh always downloads hadoop-2.8.5.tar.gz from archive.apache.org
  • hugegraph-loader/assembly/travis/install-mysql.sh always runs docker pull mysql:5.7
  • hugegraph-loader/assembly/travis/install-hugegraph-from-source.sh always clones apache/hugegraph and rebuilds the server package from source

The screenshot from the failing/re-run workflow shows Prepare env and service taking about 19 minutes, with a large Hadoop tarball download dominating the step.

loader-ci
└─ Prepare env and service
   ├─ install-hadoop.sh
   │  └─ wget hadoop-2.8.5.tar.gz  (large tarball, repeated)
   ├─ install-mysql.sh
   │  └─ docker pull mysql:5.7     (repeated image pull)
   └─ install-hugegraph-from-source.sh
      └─ git clone + mvn package   (repeated source build)

Why this matters

  • CI duration is much longer than necessary
  • CI becomes more fragile because it depends on multiple external downloads during the test phase
  • Re-runs are expensive even when the code change is unrelated to loader integration environments
  • Current cache coverage likely does not match the real bottlenecks

Suggested directions

Prefer official artifacts / containers over ad-hoc install scripts

  • Replace the MySQL setup script with a GitHub Actions services container or another pinned official image
  • Replace the Hadoop local install script with a pinned container/image or other official prebuilt artifact if possible
  • For HugeGraph server, prefer a reusable prebuilt tarball/artifact for the pinned commit/version instead of cloning and packaging from source on every CI run

If scripts must remain, make them cache-aware and idempotent

  • Add cache coverage for downloaded tarballs or extracted runtime directories if we still use script-based setup
  • Skip wget / docker pull / clone+build when the required artifact is already available
  • Make the scripts check for existing files/directories before re-downloading or rebuilding
  • Verify whether GitHub Actions cache is currently missing the relevant paths, or whether restore keys are ineffective for this use case

Possible scope

  • .github/workflows/loader-ci.yml
  • hugegraph-loader/assembly/travis/install-hadoop.sh
  • hugegraph-loader/assembly/travis/install-mysql.sh
  • hugegraph-loader/assembly/travis/install-hugegraph-from-source.sh

Expected outcome

  • Repeated loader-ci runs should not re-download the same Hadoop tarball every time
  • MySQL setup should rely on a reusable/pinned container path rather than always pulling inside the script
  • HugeGraph server setup should reuse a stable artifact or cacheable output where possible
  • Prepare env and service time should drop significantly and become more stable

Metadata

Metadata

Assignees

No one assigned

    Labels

    ciContinuous integrationenhancementNew feature or requestloaderhugegraph-loader

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions