Skip to content

Enhancement: Add GPU NUMA Topology Awareness to Scheduler #4998

@zjj2wry

Description

@zjj2wry

What is the problem you're trying to solve

Currently, Volcano scheduler supports CPU NUMA topology awareness through the numaaware plugin, but it doesn't consider GPU NUMA topology when scheduling GPU workloads.

Example scenario:

Consider a cluster with two GPU nodes and a task requesting 4 GPUs:

Node A:

  • 8 GPUs total (GPUs 0-7)
  • GPUs 0-3 are on NUMA node 0
  • GPUs 4-7 are on NUMA node 1
  • Available: GPUs 0-2 on NUMA 0, GPU 7 on NUMA 1

Node B:

  • 8 GPUs total (GPUs 0-7)
  • GPUs 0-3 are on NUMA node 0
  • GPUs 4-7 are on NUMA node 1
  • Available: GPUs 0-5 (all 4 from NUMA 0 available)

Without GPU NUMA awareness, the scheduler might choose Node A and allocate GPUs 0,1,2,7, which spans both NUMA nodes. With GPU NUMA awareness, the scheduler should prefer Node B where it can allocate GPUs 0-3 from a single NUMA node, providing better performance.

Describe the solution you'd like

Extend the numaaware plugin to support GPU NUMA topology awareness:

  1. GPU Topology Information Collection
  2. GPU HintProvider Implementation

Additional context

No response

Documentation Updates

  • This feature requires design or user documentation changes.
  • If documentation changes are required, I will ensure the relevant documents are updated and published to the Volcano official website (https://volcano.sh) via the volcano-sh/website repository.

Metadata

Metadata

Assignees

No one assigned

    Labels

    kind/featureCategorizes issue or PR as related to a new feature.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions