Skip to content

Conversation

@SolarWindRider
Copy link

What does this PR do?

This PR fixes a device-check bug verl/utils/device.py that frequently occurs when running VeRL on Ascend NPUs in cloud server container , as well as an error in the launch command in docs/ascend_tutorial/ascend_quick_start.

Updated the npu-smi command to retrieve version information more broadly and modified version parsing logic. Previous method need higher permission (sudo), causing error in cloud server sometimes .
Because `override_model_config` in **fsdp_workers.py** changes the model’s `attn_implementation` to **Flash Attention**, which is not currently supported on **Ascend**, the NPU must use **SDPA** or **eager attention** here instead.
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request fixes a bug in device detection for Ascend NPUs and updates a launch command in the documentation. The change to verl/utils/device.py makes the npu-smi command more general, which is a good improvement for compatibility. However, the logic for parsing the software version has become fragile. I've provided a suggestion to make the version parsing more robust using a regular expression to prevent potential issues with incorrect version detection.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant