add rocshmem support by danielhua23 · Pull Request #359 · gpu-mode/kernelbot

danielhua23 · 2025-09-23T01:38:35Z

Description

derivation of #349

github-actions · 2025-09-23T01:39:41Z

Coverage report

This PR does not seem to contain any modification to coverable code.

danielhua23 · 2025-09-23T05:32:07Z

@msaroufim @saienduri
publish a new docker here: https://github.com/gpu-mode/discord-cluster-manager/actions/runs/17933366222/job/50994656849
but when I trigger a job, always report an unexpected error: https://github.com/gpu-mode/discord-cluster-manager/actions/runs/17935505394
the CI check also failed one

could you pls help take a look?

msaroufim · 2025-09-23T14:34:37Z

@danielhua23 this is not working quite yet but I found an easier way to test your code, I tested this action using a script https://github.com/gpu-mode/discord-cluster-manager/actions/runs/17949472670 - it's green but run_result crashed

With this error

{"success": true, "error": "", "system": {"gpu": "AMD Instinct MI300X VF", "device_count": 1, "cpu": "INTEL(R) XEON(R) PLATINUM 8568Y+", "runtime": "ROCm", "platform": "Linux-6.1.0-35-amd64-x86_64-with-glibc2.35", "torch": "2.10.0.dev20250916+rocm6.3"}, "runs": {"test": {"start": "2025-09-23T14:29:37.617259", "end": "2025-09-23T14:29:46.220622", "compilation": null, "run": {"success": true, "passed": false, "command": "python rocshmem_test.py test /tmp/tmp3d5ku9j1", "stdout": "=== ROCshmem PyTorch Inline Test ===\n[1/2] clang++ -MMD -MF main.o.d -DTORCH_EXTENSION_NAME=rocshmem_test -DTORCH_API_INCLUDE_EXTENSION_H -isystem /usr/local/lib/python3.10/dist-packages/torch/include -isystem /usr/local/lib/python3.10/dist-packages/torch/include/torch/csrc/api/include -isystem /usr/include/python3.10 -fPIC -std=c++17 -I/opt/rocm/include -I/home/runner/rocshmem/include/rocshmem -I/opt/openmpi/include -c /home/runner/.cache/torch_extensions/py310_cpu/rocshmem_test/main.cpp -o main.o -D__HIP_PLATFORM_AMD__=1 -DUSE_ROCM=1 -DHIPBLAS_V2 -fPIC\nFAILED: [code=1] main.o \nclang++ -MMD -MF main.o.d -DTORCH_EXTENSION_NAME=rocshmem_test -DTORCH_API_INCLUDE_EXTENSION_H -isystem /usr/local/lib/python3.10/dist-packages/torch/include -isystem /usr/local/lib/python3.10/dist-packages/torch/include/torch/csrc/api/include -isystem /usr/include/python3.10 -fPIC -std=c++17 -I/opt/rocm/include -I/home/runner/rocshmem/include/rocshmem -I/opt/openmpi/include -c /home/runner/.cache/torch_extensions/py310_cpu/rocshmem_test/main.cpp -o main.o -D__HIP_PLATFORM_AMD__=1 -DUSE_ROCM=1 -DHIPBLAS_V2 -fPIC\n/home/runner/.cache/torch_extensions/py310_cpu/rocshmem_test/main.cpp:3:14: fatal error: 'rocshmem.hpp' file not found\n    #include <rocshmem.hpp>\n             ^~~~~~~~~~~~~~\n1 error generated.\nninja: build stopped: subcommand failed.\nROCshmem test failed: Error building extension 'rocshmem_test'\n", "stderr": "", "exit_code": 0, "duration": 8.593523531220853, "result": {}}, "profile": null}}}

Here's an easier script for you to test stuff out

#!/usr/bin/env python3
"""Test script to trigger AMD workflow with ROCshmem payload"""

import json
import base64
import zlib
import subprocess
import sys

def main():
    # Load the test payload
    with open('scripts/rocshmem_test_payload.json', 'r') as f:
        payload_dict = json.load(f)
    
    # Compress and encode the payload (same as GitHub launcher does)
    payload_json = json.dumps(payload_dict)
    compressed = zlib.compress(payload_json.encode('utf-8'))
    encoded = base64.b64encode(compressed).decode('utf-8')
    
    print(f"Original payload size: {len(payload_json)} bytes")
    print(f"Compressed size: {len(compressed)} bytes")
    print(f"Encoded size: {len(encoded)} bytes")
    
    # Generate a run ID
    import uuid
    run_id = str(uuid.uuid4())
    
    # Trigger the workflow using gh CLI
    cmd = [
        'gh', 'workflow', 'run', 'amd_workflow.yml',
        '--ref', 'danie/rocshmem',  # Your current branch
        '-f', f'run_id={run_id}',
        '-f', f'payload={encoded}',
        '-f', 'runner=amdgpu-mi300-x86-64'
    ]
    
    print(f"Run ID: {run_id}")
    
    print("\nTriggering workflow with command:")
    print(' '.join(cmd))
    
    result = subprocess.run(cmd, capture_output=True, text=True)
    
    if result.returncode == 0:
        print("\n✓ Workflow triggered successfully!")
        print("\nTo view the run status:")
        print("gh run list --workflow=amd_workflow.yml -L 1")
        print("\nTo watch the run:")
        print("gh run watch --workflow=amd_workflow.yml")
    else:
        print(f"\n✗ Failed to trigger workflow:")
        print(result.stderr)
        sys.exit(1)

if __name__ == '__main__':
    main()

danielhua23 · 2025-09-23T15:13:48Z

this error is expected if haven't rebuilt docker using my new dockerfile, so I have been asking how to rebuild a docker and get it work with the test code using my new dockerfile in this PR lol.

saienduri · 2025-09-23T16:34:10Z

Can you try now @danielhua23? Test runner has to be manually updated when switching the branch we build the docker from.

msaroufim · 2025-09-23T16:58:33Z

@saienduri I'm seeing the same issue still, do you guys mind coordinating a fix synchronously? We're starting problem 3 soon and we still don't have support for this

{"success": true, "error": "", "system": {"gpu": "AMD Instinct MI300X VF", "device_count": 1, "cpu": "INTEL(R) XEON(R) PLATINUM 8568Y+", "runtime": "ROCm", "platform": "Linux-6.1.0-35-amd64-x86_64-with-glibc2.35", "torch": "2.10.0.dev20250916+rocm6.3"}, "runs": {"test": {"start": "2025-09-23T16:55:33.158776", "end": "2025-09-23T16:55:41.875478", "compilation": null, "run": {"success": true, "passed": false, "command": "python rocshmem_test.py test /tmp/tmp6586x0u_", "stdout": "=== ROCshmem PyTorch Inline Test ===\n[1/2] clang++ -MMD -MF main.o.d -DTORCH_EXTENSION_NAME=rocshmem_test -DTORCH_API_INCLUDE_EXTENSION_H -isystem /usr/local/lib/python3.10/dist-packages/torch/include -isystem /usr/local/lib/python3.10/dist-packages/torch/include/torch/csrc/api/include -isystem /usr/include/python3.10 -fPIC -std=c++17 -I/opt/rocm/include -I/home/runner/rocshmem/include/rocshmem -I/opt/openmpi/include -c /home/runner/.cache/torch_extensions/py310_cpu/rocshmem_test/main.cpp -o main.o -D__HIP_PLATFORM_AMD__=1 -DUSE_ROCM=1 -DHIPBLAS_V2 -fPIC\nFAILED: [code=1] main.o \nclang++ -MMD -MF main.o.d -DTORCH_EXTENSION_NAME=rocshmem_test -DTORCH_API_INCLUDE_EXTENSION_H -isystem /usr/local/lib/python3.10/dist-packages/torch/include -isystem /usr/local/lib/python3.10/dist-packages/torch/include/torch/csrc/api/include -isystem /usr/include/python3.10 -fPIC -std=c++17 -I/opt/rocm/include -I/home/runner/rocshmem/include/rocshmem -I/opt/openmpi/include -c /home/runner/.cache/torch_extensions/py310_cpu/rocshmem_test/main.cpp -o main.o -D__HIP_PLATFORM_AMD__=1 -DUSE_ROCM=1 -DHIPBLAS_V2 -fPIC\n/home/runner/.cache/torch_extensions/py310_cpu/rocshmem_test/main.cpp:3:14: fatal error: 'rocshmem.hpp' file not found\n    #include <rocshmem.hpp>\n             ^~~~~~~~~~~~~~\n1 error generated.\nninja: build stopped: subcommand failed.\nROCshmem test failed: Error building extension 'rocshmem_test'\n", "stderr": "", "exit_code": 0, "duration": 8.706833689007908, "result": {}}, "profile": null}}}

saienduri · 2025-09-24T05:20:44Z

Latest run looks good: https://github.com/gpu-mode/discord-cluster-manager/actions/runs/17966165042
Can we merge @msaroufim?

msaroufim · 2025-09-24T15:30:11Z

Just kicked off CI again

add rocshmem support

b205fa1

danielhua23 added 3 commits September 23, 2025 01:40

pin torch version

8270b21

add back iris

84e3941

rm space

cbaf09f

saienduri self-requested a review September 24, 2025 05:20

saienduri approved these changes Sep 24, 2025

View reviewed changes

Trigger CI

57b5c50

msaroufim merged commit c613971 into main Sep 24, 2025
5 of 6 checks passed

msaroufim mentioned this pull request Sep 24, 2025

rocshmem dependencies #349

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

add rocshmem support#359

add rocshmem support#359
msaroufim merged 5 commits intomainfrom
danie/rocshmem

danielhua23 commented Sep 23, 2025

Uh oh!

github-actions bot commented Sep 23, 2025

Uh oh!

danielhua23 commented Sep 23, 2025 •

edited

Loading

Uh oh!

msaroufim commented Sep 23, 2025

Uh oh!

danielhua23 commented Sep 23, 2025

Uh oh!

saienduri commented Sep 23, 2025

Uh oh!

msaroufim commented Sep 23, 2025

Uh oh!

saienduri commented Sep 24, 2025

Uh oh!

msaroufim commented Sep 24, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Comments

Conversation

danielhua23 commented Sep 23, 2025

Description

Uh oh!

github-actions bot commented Sep 23, 2025

Coverage report

Uh oh!

danielhua23 commented Sep 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

msaroufim commented Sep 23, 2025

Uh oh!

danielhua23 commented Sep 23, 2025

Uh oh!

saienduri commented Sep 23, 2025

Uh oh!

msaroufim commented Sep 23, 2025

Uh oh!

saienduri commented Sep 24, 2025

Uh oh!

msaroufim commented Sep 24, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

danielhua23 commented Sep 23, 2025 •

edited

Loading