Skip to content

Fix CI container name collision for parallel matrix jobs#212

Open
sunway513 wants to merge 4 commits intoROCm:mainfrom
sunway513:fix/ci-unique-container-names
Open

Fix CI container name collision for parallel matrix jobs#212
sunway513 wants to merge 4 commits intoROCm:mainfrom
sunway513:fix/ci-unique-container-names

Conversation

@sunway513
Copy link
Collaborator

Summary

  • Multiple matrix jobs sharing the same runner used a hardcoded container name atom_test, causing cleanup steps to kill sibling jobs' containers and leading to intermittent GPU OOM errors.
  • Use CONTAINER_NAME=atom_test_${{ strategy.job-index }} so each job operates on its own container without interference.
  • GPU isolation is already handled by gha-render-devices — the container name collision was the sole root cause.

Changes

  • Added CONTAINER_NAME env var using strategy.job-index for unique per-job container names
  • Replaced all 12 hardcoded atom_test container references with $CONTAINER_NAME
  • Used exact-match filter name="^${CONTAINER_NAME}$" in cleanup to avoid interfering with sibling jobs
  • Docker image tag atom_test:ci left unchanged (shared image is fine)

Test plan

  • Verify CI passes with parallel jobs on shared runners (linux-atom-mi355-1)
  • Confirm no container name collisions in job logs
  • Verify cleanup step only stops/removes its own container

sunway513 and others added 2 commits February 12, 2026 23:41
Multiple matrix jobs on the same runner used a hardcoded container name
"atom_test", causing cleanup steps to kill sibling jobs' containers.
Use CONTAINER_NAME=atom_test_${strategy.job-index} so each job operates
on its own container without interference.

GPU isolation is already handled by gha-render-devices, so the container
name collision was the sole root cause of the intermittent GPU OOM errors.
@sunway513
Copy link
Collaborator Author

@gyohuangxin do you think this change is relevant?

@gyohuangxin
Copy link
Member

@gyohuangxin do you think this change is relevant?

I think it makes sense, could you convert this PR to ready to let CI verify it?

ChuanLi1101
ChuanLi1101 previously approved these changes Feb 14, 2026
Copy link
Collaborator

@ChuanLi1101 ChuanLi1101 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks.

Address review suggestion: quote all $CONTAINER_NAME references
in docker commands for defensive shell hygiene.
@sunway513 sunway513 marked this pull request as ready for review February 16, 2026 03:23
Copilot AI review requested due to automatic review settings February 16, 2026 03:23
@sunway513
Copy link
Collaborator Author

updated, plz review again @ChuanLi1101

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR attempts to fix container name collisions when multiple matrix jobs run in parallel on the same GitHub Actions runner by introducing unique per-job container names. The approach uses strategy.job-index to differentiate containers, replacing 12 hardcoded references to atom_test with a dynamic $CONTAINER_NAME variable.

Changes:

  • Added CONTAINER_NAME environment variable using strategy.job-index for unique container names
  • Updated all container lifecycle commands (stop, rm, exec) to use $CONTAINER_NAME
  • Modified cleanup filters to use exact-match patterns for container isolation

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +155 to +156
(docker ps -aq -f name="^${CONTAINER_NAME}$" | xargs -r docker stop) || true
(docker ps -aq -f name="^${CONTAINER_NAME}$" | xargs -r docker rm) || true
Copy link

Copilot AI Feb 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Docker's --filter name= does not support regex anchors like ^ and $. The filter performs a substring match by default. To prevent matching containers like atom_test_1_backup when looking for atom_test_1, you should rely on exact matching by ensuring unique enough names. However, since the base approach using strategy.job-index is not valid, this will need to be reconsidered once the container naming strategy is corrected.

Suggested change
(docker ps -aq -f name="^${CONTAINER_NAME}$" | xargs -r docker stop) || true
(docker ps -aq -f name="^${CONTAINER_NAME}$" | xargs -r docker rm) || true
(docker ps -aq -f name="${CONTAINER_NAME}" | xargs -r docker stop) || true
(docker ps -aq -f name="${CONTAINER_NAME}" | xargs -r docker rm) || true

Copilot uses AI. Check for mistakes.
runs-on: ${{ matrix.runner }}
env:
ATOM_BASE_NIGTHLY_IMAGE: rocm/atom-dev:latest
CONTAINER_NAME: atom_test_${{ strategy.job-index }}
Copy link

Copilot AI Feb 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

strategy.job-index is not a valid GitHub Actions context variable. GitHub Actions does not provide a built-in job index for matrix jobs. Consider using a combination of matrix values to create unique container names instead. For example:

  • Use github.run_id with a sanitized version of matrix.model_name
  • Or use github.run_id with github.run_attempt and sanitized matrix values

Example: CONTAINER_NAME: atom_test_${{ github.run_id }}_${{ github.run_attempt }}_${{ matrix.model_name }}

Note: You'll need to sanitize matrix.model_name to remove special characters that are invalid in container names (e.g., replace spaces and special chars with underscores).

Suggested change
CONTAINER_NAME: atom_test_${{ strategy.job-index }}
CONTAINER_NAME: atom_test_${{ github.run_id }}_${{ github.run_attempt }}_${{ toLower(replace(replace(replace(replace(matrix.model_name, ' ', '_'), '/', '_'), ':', '_'), '.', '_')) }}

Copilot uses AI. Check for mistakes.
Copy link
Collaborator

@ChuanLi1101 ChuanLi1101 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Follow the Co-Pilot suggestion. Made a suggestion to avoid a low probability collision. I think you can merge when fix it. Approved in advance for convenience.

runs-on: ${{ matrix.runner }}
env:
ATOM_BASE_NIGTHLY_IMAGE: rocm/atom-dev:latest
CONTAINER_NAME: atom_test_${{ strategy.job-index }}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Follow the co-pilot: Consider change CONTAINER_NAME: atom_test_${{ strategy.job-index }}
to:
CONTAINER_NAME: atom_test_${{ github.run_id }}_${{ strategy.job-index }}

The reason is strategy.job-index is only unique within a single workflow run. If two PRs trigger CI at the same time and are scheduled on the same runner, both runs may have job-index=0, resulting in both generating atom_test_0 and causing a naming collision. Adding github.run_id. Even if two workflow runs execute concurrently on the same runner, they will not collide.

@sunway513 sunway513 marked this pull request as draft February 20, 2026 16:33
@sunway513 sunway513 marked this pull request as ready for review February 26, 2026 17:52
@valarLip
Copy link
Collaborator

@gyohuangxin maybe you can help resolve the conflict? so we can merge this one

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants