Dra support #359

wangqianqianjun · 2025-09-14T14:07:39Z

Early Filtering Optimization
Pre-filters GPUs by phase and model before expensive CEL evaluation to reduce processing overhead.
CEL Parallel Processing
Automatically enables parallel evaluation for large datasets (≥2000 GPUs) using worker goroutines with dynamic chunking
Zero Memory Allocation
Eliminates map allocations through ZeroAllocActivation and lazy caching of GPU field values to minimize GC pressure.

goos: linux
goarch: amd64
pkg: github.com/NexusGPU/tensor-fusion/internal/gpuallocator/filter/cel_filter
cpu: 13th Gen Intel(R) Core(TM) i7-13700KF
BenchmarkFilterPerformance/OriginalFilters-24 39 30419082 ns/op 15205268 B/op 20 allocs/op
BenchmarkFilterPerformance/CELFilter_Basic-24 51 23382077 ns/op 8003896 B/op 8 allocs/op
BenchmarkFilterPerformance/CELFilter_Complex-24 12 94239518 ns/op 57643526 B/op 2471372 allocs/op
BenchmarkFilterPerformance/CELFilter_CacheMiss-24 10 112866842 ns/op 82081142 B/op 3528066 allocs/op

…GPU#338)

…contain permissions (NexusGPU#349) Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com>

* support dedicated gpus * support dedicated GPU * support dedicated GPU * fix test issue

…sGPU#350) * fix: skip gpu limiter not working issue * fix: avoid k8s QoS side effect for inject lib init container * fix: potential panic issues * fix: remove unused event

* support dedicated gpus * support dedicated GPU * support dedicated GPU * fix test issue * fix init pricing override vran * Revert "fix init pricing override vran" This reverts commit d0bea18. * fix init pricing override vram

…e shm (NexusGPU#352)

* chore: lint issue * fix: kubernetes upgrade, fix scheduler deps issue * fix: upgrade k8s version to 1.34, use fixed operator version in helm chart * fix: update shm path * chore: comment & wording * fix: connection naming * fix: upgrade github action * fix: add test for dedicated gpu allocation mode

…#356)

…k domain name, virtual cap calculation (NexusGPU#357) * fix: virtual tflops/vram not calculated bug * fix: extract GPU map update logic into separate method and fix webhook domain name * fix: nvidia device plugin compatible mode state consistent issue * fix: nvidia device plugin compatible mode issue

…exusGPU#362)

…usGPU#360)

NexusGPU#363)

* fix: gpu info update * feat: preempt scheduling, fix metrics scheduling bugs, add evict protection * fix: unit test issue * fix: preempt unit testing * fix: lint issue, add qos to priorityClassName converting

…exusGPU#365) - Add double-check for TFLOPs and VRAM availability before allocation

…ild and dra request build in the same logic

- Implemented DRA CEL filters in GPU allocation requests - Added benchmarks for basic and complex expressions - Updated the resource slice controller to support Kubernetes hostname labels

… calculations - Added DRA device attribute constants for improved resource management - Updated ResourceSlice controller to calculate node-level virtual capacities based on oversubscription configurations - Enhanced device generation logic to include virtual capacities and additional attributes for better GPU tracking - Implemented field indexing for ResourceSlice by nodeName to optimize queries

- Bump version of github.com/google/cel-go to v0.26.1 - Update Kubernetes dependencies to v0.34.1 for api, apiserver, component-base, dynamic-resource-allocation, kubelet, and component-helpers - Add replace directives for local Kubernetes vendor sources

- Removed unnecessary GPUPool retrieval and oversubscription calculations - Updated virtual capacity calculations to use equal distribution for GPUs - Cleaned up code by eliminating the calculateNodeVirtualCapacity function

… updates - Added logic to update ResourceClaim's capacity requests and device count based on Pod annotations - Implemented idempotency checks to avoid unnecessary updates - Improved error handling and logging for better traceability - Introduced new tests to validate the updates and ensure correct behavior

- Introduced a comprehensive user guide for Tensor Fusion DRA, detailing setup, configuration, and usage for various stakeholders. - Enhanced ResourceSlice controller to retrieve and utilize default QoS from GPUPool for improved resource allocation. - Updated device generation logic to include QoS attributes, ensuring better tracking and management of GPU resources. - Added constants for QoS attributes to facilitate future development and integration.

wangqianqianjun and others added 24 commits August 25, 2025 06:46

support cel filter

c345426

Merge branch 'NexusGPU:main' into main

10251e4

covert allocator request to cel filter

7be8e25

support annotaion cel

fc26511

remove deperate config

6978807

remove docs

d3c112a

chore(deps): bump golang from 1.24 to 1.25 in /dockerfile (NexusGPU#325)

d466cda

chore(deps): bump cycjimmy/semantic-release-action from 4 to 5 (Nexus…

8bd5e89

…GPU#338)

fix: helm chart issue (NexusGPU#346)

67b1c64

Merge branch 'NexusGPU:main' into main

5dc9c79

chore(deps): bump k8s.io/kubernetes (NexusGPU#347)

dbc088c

fix: Potential fix for code scanning alert no. 36: Workflow does not …

865bdf5

…contain permissions (NexusGPU#349) Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com>

support dedicated-gpu (NexusGPU#345)

9006e96

* support dedicated gpus * support dedicated GPU * support dedicated GPU * fix test issue

fix: skip gpu limiter not working issue, observability optimize (Nexu…

0389852

…sGPU#350) * fix: skip gpu limiter not working issue * fix: avoid k8s QoS side effect for inject lib init container * fix: potential panic issues * fix: remove unused event

fix: init pricing overwrite vram to 0 (NexusGPU#351)

c0a3500

* support dedicated gpus * support dedicated GPU * support dedicated GPU * fix test issue * fix init pricing override vran * Revert "fix init pricing override vran" This reverts commit d0bea18. * fix init pricing override vram

fix: add node hash for gpu k8s node, owner ref for hypervisor, isolat…

f25c65d

…e shm (NexusGPU#352)

cel fliter enhancement

52d4fd2

fix: dedicated gpu annotation causing webhook failure issue (NexusGPU…

e55e53d

…#356)

cel fix phase filter

52dc0a4

disable predicate fast path

cd1d7dd

fix lint issue

f700eac

Merge branch 'main' into dra

8503585

wangqianqianjun changed the title ~~dra-cel filter Performance Optimization~~ dra: cel filter Performance Optimization Sep 14, 2025

dependabot bot added 5 commits September 15, 2025 12:14

chore(deps): bump github.com/aws/aws-sdk-go-v2 from 1.38.3 to 1.39.0 (N…

de5b0c1

…exusGPU#362)

chore(deps): bump gorm.io/gorm from 1.30.3 to 1.31.0 (NexusGPU#361)

3d9b2c4

chore(deps): bump k8s.io/client-go from 0.34.0 to 0.34.1 (NexusGPU#364)

ec36d4a

chore(deps): bump k8s.io/component-helpers from 0.34.0 to 0.34.1 (Nex…

40b98a8

…usGPU#360)

chore(deps): bump sigs.k8s.io/controller-runtime from 0.22.0 to 0.22.1 (

a45ba60

NexusGPU#363)

Code2Life and others added 5 commits September 17, 2025 22:13

feat: preempt support for GPU workers (NexusGPU#366)

5867f3c

* fix: gpu info update * feat: preempt scheduling, fix metrics scheduling bugs, add evict protection * fix: unit test issue * fix: preempt unit testing * fix: lint issue, add qos to priorityClassName converting

fix: add resource validation in Bind to prevent GPU over-allocation (N…

4fc9dc9

…exusGPU#365) - Add double-check for TFLOPs and VRAM availability before allocation

webhook & gpu resource fit dra support

5f25794

resource template support

4959c61

support resource claim cel builder

ff9efd2

wangqianqianjun force-pushed the dra branch from ff9efd2 to 4fc9dc9 Compare September 24, 2025 14:53

wangqianqianjun added 8 commits September 24, 2025 08:01

fix conflict

f48f00a

fix conflict for gpuresources.go

1afc62d

1. support resource slice build and destory 2. make resource slice bu…

efbce3f

…ild and dra request build in the same logic

feat: Added DRA CEL filter support

7d95fef

- Implemented DRA CEL filters in GPU allocation requests - Added benchmarks for basic and complex expressions - Updated the resource slice controller to support Kubernetes hostname labels

wangqianqianjun changed the title ~~dra: cel filter Performance Optimization~~ Dra support Oct 14, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Dra support #359

Dra support #359

Uh oh!

wangqianqianjun commented Sep 14, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Dra support #359

Are you sure you want to change the base?

Dra support #359

Uh oh!

Conversation

wangqianqianjun commented Sep 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

wangqianqianjun commented Sep 14, 2025 •

edited

Loading