-
Notifications
You must be signed in to change notification settings - Fork 1.3k
Open
Labels
kind/featureCategorizes issue or PR as related to a new feature.Categorizes issue or PR as related to a new feature.
Description
What is the problem you're trying to solve
When a PodGroup is moved to Inqueue state, it immediately reserves resources from the queue's capacity. However, if the pods cannot be scheduled due to reasons like:
- Node affinity/anti-affinity constraints
- Insufficient resources per node (e.g., requesting 8 GPUs but max 4 GPUs/node)
- ...
The PodGroup will stay in Inqueue state indefinitely, blocking queue resources that could be used by other jobs.(Resource waste)
Describe the solution you'd like
Implement an Inqueue timeout mechanism with the following behavior:
- Queue-level default timeout (e.g., spec.inqueueTimeout: 10m
- Record timestamp when PodGroup enters Inqueue state
- Timeout action: When timeout is reached and no pods have been successfully scheduled, transition PodGroup back to Pending state
- Release reserved queue resources
Additional context
No response
Documentation Updates
- This feature requires design or user documentation changes.
- If documentation changes are required, I will ensure the relevant documents are updated and published to the Volcano official website (https://volcano.sh) via the volcano-sh/website repository.
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
kind/featureCategorizes issue or PR as related to a new feature.Categorizes issue or PR as related to a new feature.