Skip to content

Conversation

wangqianqianjun
Copy link
Collaborator

@wangqianqianjun wangqianqianjun commented Sep 14, 2025

  1. Early Filtering Optimization
    Pre-filters GPUs by phase and model before expensive CEL evaluation to reduce processing overhead.

  2. CEL Parallel Processing
    Automatically enables parallel evaluation for large datasets (≥2000 GPUs) using worker goroutines with dynamic chunking

  3. Zero Memory Allocation
    Eliminates map allocations through ZeroAllocActivation and lazy caching of GPU field values to minimize GC pressure.

goos: linux
goarch: amd64
pkg: github.com/NexusGPU/tensor-fusion/internal/gpuallocator/filter/cel_filter
cpu: 13th Gen Intel(R) Core(TM) i7-13700KF
BenchmarkFilterPerformance/OriginalFilters-24 39 30419082 ns/op 15205268 B/op 20 allocs/op
BenchmarkFilterPerformance/CELFilter_Basic-24 51 23382077 ns/op 8003896 B/op 8 allocs/op
BenchmarkFilterPerformance/CELFilter_Complex-24 12 94239518 ns/op 57643526 B/op 2471372 allocs/op
BenchmarkFilterPerformance/CELFilter_CacheMiss-24 10 112866842 ns/op 82081142 B/op 3528066 allocs/op

wangqianqianjun and others added 24 commits August 25, 2025 06:46
…contain permissions (NexusGPU#349)

Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com>
* support dedicated gpus

* support dedicated GPU

* support dedicated GPU

* fix test issue
…sGPU#350)

* fix: skip gpu limiter not working issue

* fix: avoid k8s QoS side effect for inject lib init container

* fix: potential panic issues

* fix: remove unused event
* support dedicated gpus

* support dedicated GPU

* support dedicated GPU

* fix test issue

* fix init pricing override vran

* Revert "fix init pricing override vran"

This reverts commit d0bea18.

* fix init pricing override vram
* chore: lint issue

* fix: kubernetes upgrade, fix scheduler deps issue

* fix: upgrade k8s version to 1.34, use fixed operator version in helm chart

* fix: update shm path

* chore: comment & wording

* fix: connection naming

* fix: upgrade github action

* fix: add test for dedicated gpu allocation mode
…k domain name, virtual cap calculation (NexusGPU#357)

* fix: virtual tflops/vram not calculated bug

* fix: extract GPU map update logic into separate method and fix webhook domain name

* fix: nvidia device plugin compatible mode state consistent issue

* fix: nvidia device plugin compatible mode issue
@wangqianqianjun wangqianqianjun changed the title dra-cel filter Performance Optimization dra: cel filter Performance Optimization Sep 14, 2025
Code2Life and others added 5 commits September 17, 2025 22:13
* fix: gpu info update

* feat: preempt scheduling, fix metrics scheduling bugs, add evict protection

* fix: unit test issue

* fix: preempt unit testing

* fix: lint issue, add qos to priorityClassName converting
…exusGPU#365)

- Add double-check for TFLOPs and VRAM availability before allocation
- Implemented DRA CEL filters in GPU allocation requests
- Added benchmarks for basic and complex expressions
- Updated the resource slice controller to support Kubernetes hostname labels
… calculations

- Added DRA device attribute constants for improved resource management
- Updated ResourceSlice controller to calculate node-level virtual capacities based on oversubscription configurations
- Enhanced device generation logic to include virtual capacities and additional attributes for better GPU tracking
- Implemented field indexing for ResourceSlice by nodeName to optimize queries
- Bump version of github.com/google/cel-go to v0.26.1
- Update Kubernetes dependencies to v0.34.1 for api, apiserver, component-base, dynamic-resource-allocation, kubelet, and component-helpers
- Add replace directives for local Kubernetes vendor sources
- Removed unnecessary GPUPool retrieval and oversubscription calculations
- Updated virtual capacity calculations to use equal distribution for GPUs
- Cleaned up code by eliminating the calculateNodeVirtualCapacity function
… updates

- Added logic to update ResourceClaim's capacity requests and device count based on Pod annotations
- Implemented idempotency checks to avoid unnecessary updates
- Improved error handling and logging for better traceability
- Introduced new tests to validate the updates and ensure correct behavior
@wangqianqianjun wangqianqianjun changed the title dra: cel filter Performance Optimization Dra support Oct 14, 2025
- Introduced a comprehensive user guide for Tensor Fusion DRA, detailing setup, configuration, and usage for various stakeholders.
- Enhanced ResourceSlice controller to retrieve and utilize default QoS from GPUPool for improved resource allocation.
- Updated device generation logic to include QoS attributes, ensuring better tracking and management of GPU resources.
- Added constants for QoS attributes to facilitate future development and integration.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants