Skip to content

Comments

add rocshmem support#359

Merged
msaroufim merged 5 commits intomainfrom
danie/rocshmem
Sep 24, 2025
Merged

add rocshmem support#359
msaroufim merged 5 commits intomainfrom
danie/rocshmem

Conversation

@danielhua23
Copy link
Collaborator

Description

derivation of #349

@github-actions
Copy link

Coverage report

This PR does not seem to contain any modification to coverable code.

@danielhua23
Copy link
Collaborator Author

danielhua23 commented Sep 23, 2025

@msaroufim @saienduri
publish a new docker here: https://github.com/gpu-mode/discord-cluster-manager/actions/runs/17933366222/job/50994656849
but when I trigger a job, always report an unexpected error: https://github.com/gpu-mode/discord-cluster-manager/actions/runs/17935505394
the CI check also failed one

could you pls help take a look?

@msaroufim
Copy link
Member

@danielhua23 this is not working quite yet but I found an easier way to test your code, I tested this action using a script https://github.com/gpu-mode/discord-cluster-manager/actions/runs/17949472670 - it's green but run_result crashed

With this error

{"success": true, "error": "", "system": {"gpu": "AMD Instinct MI300X VF", "device_count": 1, "cpu": "INTEL(R) XEON(R) PLATINUM 8568Y+", "runtime": "ROCm", "platform": "Linux-6.1.0-35-amd64-x86_64-with-glibc2.35", "torch": "2.10.0.dev20250916+rocm6.3"}, "runs": {"test": {"start": "2025-09-23T14:29:37.617259", "end": "2025-09-23T14:29:46.220622", "compilation": null, "run": {"success": true, "passed": false, "command": "python rocshmem_test.py test /tmp/tmp3d5ku9j1", "stdout": "=== ROCshmem PyTorch Inline Test ===\n[1/2] clang++ -MMD -MF main.o.d -DTORCH_EXTENSION_NAME=rocshmem_test -DTORCH_API_INCLUDE_EXTENSION_H -isystem /usr/local/lib/python3.10/dist-packages/torch/include -isystem /usr/local/lib/python3.10/dist-packages/torch/include/torch/csrc/api/include -isystem /usr/include/python3.10 -fPIC -std=c++17 -I/opt/rocm/include -I/home/runner/rocshmem/include/rocshmem -I/opt/openmpi/include -c /home/runner/.cache/torch_extensions/py310_cpu/rocshmem_test/main.cpp -o main.o -D__HIP_PLATFORM_AMD__=1 -DUSE_ROCM=1 -DHIPBLAS_V2 -fPIC\nFAILED: [code=1] main.o \nclang++ -MMD -MF main.o.d -DTORCH_EXTENSION_NAME=rocshmem_test -DTORCH_API_INCLUDE_EXTENSION_H -isystem /usr/local/lib/python3.10/dist-packages/torch/include -isystem /usr/local/lib/python3.10/dist-packages/torch/include/torch/csrc/api/include -isystem /usr/include/python3.10 -fPIC -std=c++17 -I/opt/rocm/include -I/home/runner/rocshmem/include/rocshmem -I/opt/openmpi/include -c /home/runner/.cache/torch_extensions/py310_cpu/rocshmem_test/main.cpp -o main.o -D__HIP_PLATFORM_AMD__=1 -DUSE_ROCM=1 -DHIPBLAS_V2 -fPIC\n/home/runner/.cache/torch_extensions/py310_cpu/rocshmem_test/main.cpp:3:14: fatal error: 'rocshmem.hpp' file not found\n    #include <rocshmem.hpp>\n             ^~~~~~~~~~~~~~\n1 error generated.\nninja: build stopped: subcommand failed.\nROCshmem test failed: Error building extension 'rocshmem_test'\n", "stderr": "", "exit_code": 0, "duration": 8.593523531220853, "result": {}}, "profile": null}}}

Here's an easier script for you to test stuff out

#!/usr/bin/env python3
"""Test script to trigger AMD workflow with ROCshmem payload"""

import json
import base64
import zlib
import subprocess
import sys

def main():
    # Load the test payload
    with open('scripts/rocshmem_test_payload.json', 'r') as f:
        payload_dict = json.load(f)
    
    # Compress and encode the payload (same as GitHub launcher does)
    payload_json = json.dumps(payload_dict)
    compressed = zlib.compress(payload_json.encode('utf-8'))
    encoded = base64.b64encode(compressed).decode('utf-8')
    
    print(f"Original payload size: {len(payload_json)} bytes")
    print(f"Compressed size: {len(compressed)} bytes")
    print(f"Encoded size: {len(encoded)} bytes")
    
    # Generate a run ID
    import uuid
    run_id = str(uuid.uuid4())
    
    # Trigger the workflow using gh CLI
    cmd = [
        'gh', 'workflow', 'run', 'amd_workflow.yml',
        '--ref', 'danie/rocshmem',  # Your current branch
        '-f', f'run_id={run_id}',
        '-f', f'payload={encoded}',
        '-f', 'runner=amdgpu-mi300-x86-64'
    ]
    
    print(f"Run ID: {run_id}")
    
    print("\nTriggering workflow with command:")
    print(' '.join(cmd))
    
    result = subprocess.run(cmd, capture_output=True, text=True)
    
    if result.returncode == 0:
        print("\n✓ Workflow triggered successfully!")
        print("\nTo view the run status:")
        print("gh run list --workflow=amd_workflow.yml -L 1")
        print("\nTo watch the run:")
        print("gh run watch --workflow=amd_workflow.yml")
    else:
        print(f"\n✗ Failed to trigger workflow:")
        print(result.stderr)
        sys.exit(1)

if __name__ == '__main__':
    main()

@danielhua23
Copy link
Collaborator Author

this error is expected if haven't rebuilt docker using my new dockerfile, so I have been asking how to rebuild a docker and get it work with the test code using my new dockerfile in this PR lol.

@saienduri
Copy link
Contributor

Can you try now @danielhua23? Test runner has to be manually updated when switching the branch we build the docker from.

@msaroufim
Copy link
Member

@saienduri I'm seeing the same issue still, do you guys mind coordinating a fix synchronously? We're starting problem 3 soon and we still don't have support for this

{"success": true, "error": "", "system": {"gpu": "AMD Instinct MI300X VF", "device_count": 1, "cpu": "INTEL(R) XEON(R) PLATINUM 8568Y+", "runtime": "ROCm", "platform": "Linux-6.1.0-35-amd64-x86_64-with-glibc2.35", "torch": "2.10.0.dev20250916+rocm6.3"}, "runs": {"test": {"start": "2025-09-23T16:55:33.158776", "end": "2025-09-23T16:55:41.875478", "compilation": null, "run": {"success": true, "passed": false, "command": "python rocshmem_test.py test /tmp/tmp6586x0u_", "stdout": "=== ROCshmem PyTorch Inline Test ===\n[1/2] clang++ -MMD -MF main.o.d -DTORCH_EXTENSION_NAME=rocshmem_test -DTORCH_API_INCLUDE_EXTENSION_H -isystem /usr/local/lib/python3.10/dist-packages/torch/include -isystem /usr/local/lib/python3.10/dist-packages/torch/include/torch/csrc/api/include -isystem /usr/include/python3.10 -fPIC -std=c++17 -I/opt/rocm/include -I/home/runner/rocshmem/include/rocshmem -I/opt/openmpi/include -c /home/runner/.cache/torch_extensions/py310_cpu/rocshmem_test/main.cpp -o main.o -D__HIP_PLATFORM_AMD__=1 -DUSE_ROCM=1 -DHIPBLAS_V2 -fPIC\nFAILED: [code=1] main.o \nclang++ -MMD -MF main.o.d -DTORCH_EXTENSION_NAME=rocshmem_test -DTORCH_API_INCLUDE_EXTENSION_H -isystem /usr/local/lib/python3.10/dist-packages/torch/include -isystem /usr/local/lib/python3.10/dist-packages/torch/include/torch/csrc/api/include -isystem /usr/include/python3.10 -fPIC -std=c++17 -I/opt/rocm/include -I/home/runner/rocshmem/include/rocshmem -I/opt/openmpi/include -c /home/runner/.cache/torch_extensions/py310_cpu/rocshmem_test/main.cpp -o main.o -D__HIP_PLATFORM_AMD__=1 -DUSE_ROCM=1 -DHIPBLAS_V2 -fPIC\n/home/runner/.cache/torch_extensions/py310_cpu/rocshmem_test/main.cpp:3:14: fatal error: 'rocshmem.hpp' file not found\n    #include <rocshmem.hpp>\n             ^~~~~~~~~~~~~~\n1 error generated.\nninja: build stopped: subcommand failed.\nROCshmem test failed: Error building extension 'rocshmem_test'\n", "stderr": "", "exit_code": 0, "duration": 8.706833689007908, "result": {}}, "profile": null}}}

@saienduri
Copy link
Contributor

@saienduri saienduri self-requested a review September 24, 2025 05:20
@msaroufim
Copy link
Member

Just kicked off CI again

@msaroufim msaroufim merged commit c613971 into main Sep 24, 2025
5 of 6 checks passed
@msaroufim msaroufim mentioned this pull request Sep 24, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants