Fixed autoparallel integration tests on ROCm. #2321
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR fixes the autoparallel integration tests on ROCm.
Background:
HIP runtime relies on the file /opt/amdgpu/share/libdrm/amdgpu.ids to look up the product name of AMDGPU. But this file is missing in the CI docker container. Only
/usr/share/libdrm/amdgpu.idsexists in the docker container and it is out of date - it does not include newer products like MI300. One way to address this is to use-v /opt/amdgpu:/opt/amdgpu:rowhen launching the docker container to map the /opt/amdgpu/share/libdrm/amdgpu.ids on the host to the docker container. This, however, requires changes of the workflows files in pytorch/test_infra and pytorch/pytorch. This quick fix works around the issue by updating the amdgpu.ids file within the CI docker container.