Skip to content

Inqueue PodGroups occupy queue resources even when pods cannot be scheduled #5006

@zjj2wry

Description

@zjj2wry

What is the problem you're trying to solve

When a PodGroup is moved to Inqueue state, it immediately reserves resources from the queue's capacity. However, if the pods cannot be scheduled due to reasons like:

  • Node affinity/anti-affinity constraints
  • Insufficient resources per node (e.g., requesting 8 GPUs but max 4 GPUs/node)
  • ...

The PodGroup will stay in Inqueue state indefinitely, blocking queue resources that could be used by other jobs.(Resource waste)

Describe the solution you'd like

Implement an Inqueue timeout mechanism with the following behavior:

  1. Queue-level default timeout (e.g., spec.inqueueTimeout: 10m
  2. Record timestamp when PodGroup enters Inqueue state
  3. Timeout action: When timeout is reached and no pods have been successfully scheduled, transition PodGroup back to Pending state
  4. Release reserved queue resources

Additional context

No response

Documentation Updates

  • This feature requires design or user documentation changes.
  • If documentation changes are required, I will ensure the relevant documents are updated and published to the Volcano official website (https://volcano.sh) via the volcano-sh/website repository.

Metadata

Metadata

Assignees

No one assigned

    Labels

    kind/featureCategorizes issue or PR as related to a new feature.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions