Fix CI container name collision for parallel matrix jobs#212
Fix CI container name collision for parallel matrix jobs#212
Conversation
Multiple matrix jobs on the same runner used a hardcoded container name
"atom_test", causing cleanup steps to kill sibling jobs' containers.
Use CONTAINER_NAME=atom_test_${strategy.job-index} so each job operates
on its own container without interference.
GPU isolation is already handled by gha-render-devices, so the container
name collision was the sole root cause of the intermittent GPU OOM errors.
|
@gyohuangxin do you think this change is relevant? |
I think it makes sense, could you convert this PR to ready to let CI verify it? |
Address review suggestion: quote all $CONTAINER_NAME references in docker commands for defensive shell hygiene.
|
updated, plz review again @ChuanLi1101 |
There was a problem hiding this comment.
Pull request overview
This PR attempts to fix container name collisions when multiple matrix jobs run in parallel on the same GitHub Actions runner by introducing unique per-job container names. The approach uses strategy.job-index to differentiate containers, replacing 12 hardcoded references to atom_test with a dynamic $CONTAINER_NAME variable.
Changes:
- Added
CONTAINER_NAMEenvironment variable usingstrategy.job-indexfor unique container names - Updated all container lifecycle commands (stop, rm, exec) to use
$CONTAINER_NAME - Modified cleanup filters to use exact-match patterns for container isolation
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| (docker ps -aq -f name="^${CONTAINER_NAME}$" | xargs -r docker stop) || true | ||
| (docker ps -aq -f name="^${CONTAINER_NAME}$" | xargs -r docker rm) || true |
There was a problem hiding this comment.
Docker's --filter name= does not support regex anchors like ^ and $. The filter performs a substring match by default. To prevent matching containers like atom_test_1_backup when looking for atom_test_1, you should rely on exact matching by ensuring unique enough names. However, since the base approach using strategy.job-index is not valid, this will need to be reconsidered once the container naming strategy is corrected.
| (docker ps -aq -f name="^${CONTAINER_NAME}$" | xargs -r docker stop) || true | |
| (docker ps -aq -f name="^${CONTAINER_NAME}$" | xargs -r docker rm) || true | |
| (docker ps -aq -f name="${CONTAINER_NAME}" | xargs -r docker stop) || true | |
| (docker ps -aq -f name="${CONTAINER_NAME}" | xargs -r docker rm) || true |
| runs-on: ${{ matrix.runner }} | ||
| env: | ||
| ATOM_BASE_NIGTHLY_IMAGE: rocm/atom-dev:latest | ||
| CONTAINER_NAME: atom_test_${{ strategy.job-index }} |
There was a problem hiding this comment.
strategy.job-index is not a valid GitHub Actions context variable. GitHub Actions does not provide a built-in job index for matrix jobs. Consider using a combination of matrix values to create unique container names instead. For example:
- Use
github.run_idwith a sanitized version ofmatrix.model_name - Or use
github.run_idwithgithub.run_attemptand sanitized matrix values
Example: CONTAINER_NAME: atom_test_${{ github.run_id }}_${{ github.run_attempt }}_${{ matrix.model_name }}
Note: You'll need to sanitize matrix.model_name to remove special characters that are invalid in container names (e.g., replace spaces and special chars with underscores).
| CONTAINER_NAME: atom_test_${{ strategy.job-index }} | |
| CONTAINER_NAME: atom_test_${{ github.run_id }}_${{ github.run_attempt }}_${{ toLower(replace(replace(replace(replace(matrix.model_name, ' ', '_'), '/', '_'), ':', '_'), '.', '_')) }} |
ChuanLi1101
left a comment
There was a problem hiding this comment.
Follow the Co-Pilot suggestion. Made a suggestion to avoid a low probability collision. I think you can merge when fix it. Approved in advance for convenience.
| runs-on: ${{ matrix.runner }} | ||
| env: | ||
| ATOM_BASE_NIGTHLY_IMAGE: rocm/atom-dev:latest | ||
| CONTAINER_NAME: atom_test_${{ strategy.job-index }} |
There was a problem hiding this comment.
Follow the co-pilot: Consider change CONTAINER_NAME: atom_test_${{ strategy.job-index }}
to:
CONTAINER_NAME: atom_test_${{ github.run_id }}_${{ strategy.job-index }}
The reason is strategy.job-index is only unique within a single workflow run. If two PRs trigger CI at the same time and are scheduled on the same runner, both runs may have job-index=0, resulting in both generating atom_test_0 and causing a naming collision. Adding github.run_id. Even if two workflow runs execute concurrently on the same runner, they will not collide.
|
@gyohuangxin maybe you can help resolve the conflict? so we can merge this one |
Summary
atom_test, causing cleanup steps to kill sibling jobs' containers and leading to intermittent GPU OOM errors.CONTAINER_NAME=atom_test_${{ strategy.job-index }}so each job operates on its own container without interference.gha-render-devices— the container name collision was the sole root cause.Changes
CONTAINER_NAMEenv var usingstrategy.job-indexfor unique per-job container namesatom_testcontainer references with$CONTAINER_NAMEname="^${CONTAINER_NAME}$"in cleanup to avoid interfering with sibling jobsatom_test:cileft unchanged (shared image is fine)Test plan