Conversation
|
@msaroufim @saienduri could you pls help take a look? |
|
@danielhua23 this is not working quite yet but I found an easier way to test your code, I tested this action using a script https://github.com/gpu-mode/discord-cluster-manager/actions/runs/17949472670 - it's green but run_result crashed With this error Here's an easier script for you to test stuff out #!/usr/bin/env python3
"""Test script to trigger AMD workflow with ROCshmem payload"""
import json
import base64
import zlib
import subprocess
import sys
def main():
# Load the test payload
with open('scripts/rocshmem_test_payload.json', 'r') as f:
payload_dict = json.load(f)
# Compress and encode the payload (same as GitHub launcher does)
payload_json = json.dumps(payload_dict)
compressed = zlib.compress(payload_json.encode('utf-8'))
encoded = base64.b64encode(compressed).decode('utf-8')
print(f"Original payload size: {len(payload_json)} bytes")
print(f"Compressed size: {len(compressed)} bytes")
print(f"Encoded size: {len(encoded)} bytes")
# Generate a run ID
import uuid
run_id = str(uuid.uuid4())
# Trigger the workflow using gh CLI
cmd = [
'gh', 'workflow', 'run', 'amd_workflow.yml',
'--ref', 'danie/rocshmem', # Your current branch
'-f', f'run_id={run_id}',
'-f', f'payload={encoded}',
'-f', 'runner=amdgpu-mi300-x86-64'
]
print(f"Run ID: {run_id}")
print("\nTriggering workflow with command:")
print(' '.join(cmd))
result = subprocess.run(cmd, capture_output=True, text=True)
if result.returncode == 0:
print("\n✓ Workflow triggered successfully!")
print("\nTo view the run status:")
print("gh run list --workflow=amd_workflow.yml -L 1")
print("\nTo watch the run:")
print("gh run watch --workflow=amd_workflow.yml")
else:
print(f"\n✗ Failed to trigger workflow:")
print(result.stderr)
sys.exit(1)
if __name__ == '__main__':
main() |
|
this error is expected if haven't rebuilt docker using my new dockerfile, so I have been asking how to rebuild a docker and get it work with the test code using my new dockerfile in this PR lol. |
|
Can you try now @danielhua23? Test runner has to be manually updated when switching the branch we build the docker from. |
|
@saienduri I'm seeing the same issue still, do you guys mind coordinating a fix synchronously? We're starting problem 3 soon and we still don't have support for this |
|
Latest run looks good: https://github.com/gpu-mode/discord-cluster-manager/actions/runs/17966165042 |
|
Just kicked off CI again |
Description
derivation of #349