Observed while rerunning loader-ci during PR #716 review.
Problem
The Prepare env and service step in loader-ci appears to spend a large amount of time repeatedly downloading or rebuilding external dependencies on each run, even when the versions do not change.
From the current workflow:
.github/workflows/loader-ci.yml only caches ~/.m2
hugegraph-loader/assembly/travis/install-hadoop.sh always downloads hadoop-2.8.5.tar.gz from archive.apache.org
hugegraph-loader/assembly/travis/install-mysql.sh always runs docker pull mysql:5.7
hugegraph-loader/assembly/travis/install-hugegraph-from-source.sh always clones apache/hugegraph and rebuilds the server package from source
The screenshot from the failing/re-run workflow shows Prepare env and service taking about 19 minutes, with a large Hadoop tarball download dominating the step.
loader-ci
└─ Prepare env and service
├─ install-hadoop.sh
│ └─ wget hadoop-2.8.5.tar.gz (large tarball, repeated)
├─ install-mysql.sh
│ └─ docker pull mysql:5.7 (repeated image pull)
└─ install-hugegraph-from-source.sh
└─ git clone + mvn package (repeated source build)
Why this matters
- CI duration is much longer than necessary
- CI becomes more fragile because it depends on multiple external downloads during the test phase
- Re-runs are expensive even when the code change is unrelated to loader integration environments
- Current cache coverage likely does not match the real bottlenecks
Suggested directions
Prefer official artifacts / containers over ad-hoc install scripts
- Replace the MySQL setup script with a GitHub Actions
services container or another pinned official image
- Replace the Hadoop local install script with a pinned container/image or other official prebuilt artifact if possible
- For HugeGraph server, prefer a reusable prebuilt tarball/artifact for the pinned commit/version instead of cloning and packaging from source on every CI run
If scripts must remain, make them cache-aware and idempotent
- Add cache coverage for downloaded tarballs or extracted runtime directories if we still use script-based setup
- Skip
wget / docker pull / clone+build when the required artifact is already available
- Make the scripts check for existing files/directories before re-downloading or rebuilding
- Verify whether GitHub Actions cache is currently missing the relevant paths, or whether restore keys are ineffective for this use case
Possible scope
.github/workflows/loader-ci.yml
hugegraph-loader/assembly/travis/install-hadoop.sh
hugegraph-loader/assembly/travis/install-mysql.sh
hugegraph-loader/assembly/travis/install-hugegraph-from-source.sh
Expected outcome
- Repeated
loader-ci runs should not re-download the same Hadoop tarball every time
- MySQL setup should rely on a reusable/pinned container path rather than always pulling inside the script
- HugeGraph server setup should reuse a stable artifact or cacheable output where possible
Prepare env and service time should drop significantly and become more stable
Observed while rerunning
loader-ciduring PR #716 review.Problem
The
Prepare env and servicestep inloader-ciappears to spend a large amount of time repeatedly downloading or rebuilding external dependencies on each run, even when the versions do not change.From the current workflow:
.github/workflows/loader-ci.ymlonly caches~/.m2hugegraph-loader/assembly/travis/install-hadoop.shalways downloadshadoop-2.8.5.tar.gzfromarchive.apache.orghugegraph-loader/assembly/travis/install-mysql.shalways runsdocker pull mysql:5.7hugegraph-loader/assembly/travis/install-hugegraph-from-source.shalways clonesapache/hugegraphand rebuilds the server package from sourceThe screenshot from the failing/re-run workflow shows
Prepare env and servicetaking about 19 minutes, with a large Hadoop tarball download dominating the step.Why this matters
Suggested directions
Prefer official artifacts / containers over ad-hoc install scripts
servicescontainer or another pinned official imageIf scripts must remain, make them cache-aware and idempotent
wget/docker pull/ clone+build when the required artifact is already availablePossible scope
.github/workflows/loader-ci.ymlhugegraph-loader/assembly/travis/install-hadoop.shhugegraph-loader/assembly/travis/install-mysql.shhugegraph-loader/assembly/travis/install-hugegraph-from-source.shExpected outcome
loader-ciruns should not re-download the same Hadoop tarball every timePrepare env and servicetime should drop significantly and become more stable