Skip to content

Conversation

@wenchenvincent
Copy link
Collaborator

@wenchenvincent wenchenvincent commented Feb 4, 2026

This PR fixes the autoparallel integration tests on ROCm.

Background:
HIP runtime relies on the file /opt/amdgpu/share/libdrm/amdgpu.ids to look up the product name of AMDGPU. But this file is missing in the CI docker container. Only /usr/share/libdrm/amdgpu.ids exists in the docker container and it is out of date - it does not include newer products like MI300. One way to address this is to use -v /opt/amdgpu:/opt/amdgpu:ro when launching the docker container to map the /opt/amdgpu/share/libdrm/amdgpu.ids on the host to the docker container. This, however, requires changes of the workflows files in pytorch/test_infra and pytorch/pytorch. This quick fix works around the issue by updating the amdgpu.ids file within the CI docker container.

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Feb 4, 2026
@wenchenvincent wenchenvincent marked this pull request as draft February 4, 2026 05:39
@wenchenvincent wenchenvincent marked this pull request as ready for review February 5, 2026 01:05
Copy link
Contributor

@tianyu-l tianyu-l left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

HIP runtime relies on the file /opt/amdgpu/share/libdrm/amdgpu.ids to look up the product name of AMDGPU. But this file is missing in the CI docker container.

I don't think we have the urgency to enable autoparallel CI on rocm. I don't mind waiting until a proper fix (in docker, pytorch, or test_infra) is available, compared with landing this workaround.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot. module: rocm

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants