Skip to content

Conversation

zhenyu-02
Copy link
Contributor

Pull Request Title

[Misc] Add Integration Test Utilities for PodAutoscaler Controller

Pull Request Description

This PR introduces test utilities to support integration testing for the PodAutoscaler controller. The changes include three new helper libraries that provide a clean, reusable API for:

  • Validating PodAutoscaler behavior: Comprehensive validation functions for spec, status, conditions, and scaling decisions
  • Testing HPA interactions: Utilities to verify HPA creation, updates, and synchronization with PodAutoscaler
  • Creating test fixtures: Fluent API wrapper for building PodAutoscaler test objects with minimal boilerplate

These utilities enable 31 comprehensive integration tests covering HPA strategy, spec validation, conflict detection, status management, boundary enforcement, StormService scaling, annotation-based configuration, and advanced scenarios.

Related Issues

issue 1650

Changes

Files Added

  1. test/utils/validation/hpa.go

    • Validation functions for HorizontalPodAutoscaler resources
    • Verifies HPA spec, status, owner references, and scale target references
    • Provides wait utilities for HPA creation and updates
  2. test/utils/validation/podautoscaler.go

    • Comprehensive validation functions for PodAutoscaler resources
    • Validates spec fields (min/max replicas, scaling strategy)
    • Validates status conditions (Ready, ValidSpec, AbleToScale, etc.)
    • Validates scaling history and decision tracking
    • Provides Eventually-based wait helpers for async operations
  3. test/utils/wrapper/podautoscaler.go

    • Fluent API wrapper for creating PodAutoscaler test fixtures
    • Simplifies test setup with chained method calls
    • Includes builders for various metric source types (POD, RESOURCE, EXTERNAL, CUSTOM)
    • Supports all PodAutoscaler configuration options (annotations, labels, sub-target selectors)

Test Coverage

The integration test suite (podautoscaler_test.go) covers 31 test scenarios across the following categories:

1. HPA Strategy - Resource Lifecycle (3 tests)

  • ✅ HPA creation when PodAutoscaler is created
  • ✅ HPA synchronization when PodAutoscaler spec is updated
  • ✅ HPA cascade deletion when PodAutoscaler is deleted

2. Spec Validation Logic (3 tests)

  • ✅ Invalid ScaleTargetRef detection (empty name)
  • ✅ Invalid replica bounds detection (min > max)
  • ✅ Valid spec acceptance

3. Conflict Detection Mechanism (2 tests)

  • ✅ Conflict detection when two PodAutoscalers target the same resource
  • ✅ Conflict resolution after deleting conflicting PodAutoscaler

4. Status and Condition Management (3 tests)

  • ✅ DesiredScale and ActualScale status updates
  • ✅ AbleToScale condition management
  • ✅ Ready condition state transitions

5. Scale Target Management (2 tests)

  • ✅ Deployment scaling support
  • ✅ Graceful handling of non-existent target resources

6. Boundary Enforcement (3 tests)

  • ✅ maxReplicas enforcement in HPA
  • ✅ minReplicas enforcement in HPA
  • ✅ minReplicas=0 special case handling

7. Scaling History Management (1 test)

  • ✅ Scaling history tracking with size limits

8. StormService Scaling (3 tests)

  • ✅ Replica-mode scaling (entire StormService)
  • ✅ Role-level scaling with SubTargetSelector
  • ✅ Role-level conflict detection

9. Annotation-Based Configuration (3 tests)

  • ✅ Scale-up cooldown annotation support
  • ✅ Scale-down delay annotation support
  • ✅ Multiple KPA annotations (panic-threshold, panic-window, stable-window, tolerance)

10. Advanced Scenarios (2 tests)

  • ✅ Spec update reconciliation
  • ✅ Multiple rapid updates handling

Additional Tests (6 tests)

  • ✅ StormService, RoleSet, and PodSet controller integration tests

Test Results

Note: transient “the object has been modified” warnings are expected under concurrent reconciles; all specs passed.

W1019 21:53:21.831912   73858 util.go:89] environment variable AIBRIX_POD_DEPLOYMENT_LABEL is not set, using default value: app.kubernetes.io/name
W1019 21:53:21.832303   73858 util.go:89] environment variable AIBRIX_POD_RAYCLUSTERFLEET_LABEL is not set, using default value: orchestration.aibrix.ai/raycluster-fleet-name
I1019 21:53:21.885714   73858 util.go:106] set AIBRIX_SYNC_MAX_CONTEXTS: 1000, using default value
I1019 21:53:21.885730   73858 util.go:106] set AIBRIX_SYNC_MAX_PREFIXES_PER_CONTEXT: 10000, using default value
I1019 21:53:21.885731   73858 util.go:106] set AIBRIX_SYNC_EVICTION_INTERVAL_SECONDS: 60, using default value
I1019 21:53:21.885733   73858 util.go:106] set AIBRIX_SYNC_EVICTION_DURATION_MINUTES: 20, using default value
I1019 21:53:21.885734   73858 util.go:106] set AIBRIX_PREFIX_CACHE_BLOCK_SIZE: 16, using default value
I1019 21:53:21.885757   73858 util.go:106] set AIBRIX_POD_METRIC_REFRESH_INTERVAL_MS: 50, using default value
W1019 21:53:21.885838   73858 util.go:89] environment variable AIBRIX_MODEL_GPU_PROFILE_CACHING_FLAG is not set, using default value: true
W1019 21:53:21.885910   73858 util.go:89] environment variable AIBRIX_GPU_OPTIMIZER_TRACING_FLAG is not set, using default value: false
=== RUN   TestAPIs
Running Suite: Controller Suite - /Users/ts-zhenyu.b.wang/self-aibrix/aibrix/test/integration/controller
========================================================================================================
Random Seed: 1760882001

Will run 31 of 31 specs
••••••E1019 21:53:41.053270   73858 podautoscaler_controller.go:679] "Failed to update PodAutoscaler status" err="Operation cannot be fulfilled on podautoscalers.autoscaling.aibrix.ai \"pa-hpa-create\": the object has been modified; please apply your changes to the latest version and try again"
•E1019 21:53:41.319505   73858 podautoscaler_controller.go:679] "Failed to update PodAutoscaler status" err="Operation cannot be fulfilled on podautoscalers.autoscaling.aibrix.ai \"pa-hpa-update\": the object has been modified; please apply your changes to the latest version and try again"
•E1019 21:53:47.590559   73858 podautoscaler_controller.go:679] "Failed to update PodAutoscaler status" err="Operation cannot be fulfilled on podautoscalers.autoscaling.aibrix.ai \"pa-hpa-delete\": the object has been modified; please apply your changes to the latest version and try again"
•••E1019 21:53:54.879103   73858 podautoscaler_controller.go:679] "Failed to update PodAutoscaler status" err="Operation cannot be fulfilled on podautoscalers.autoscaling.aibrix.ai \"pa-valid-spec\": the object has been modified; please apply your changes to the latest version and try again"
•E1019 21:53:56.896152   73858 podautoscaler_controller.go:679] "Failed to update PodAutoscaler status" err="Operation cannot be fulfilled on podautoscalers.autoscaling.aibrix.ai \"pa-conflict-1\": the object has been modified; please apply your changes to the latest version and try again"
•E1019 21:54:00.912228   73858 podautoscaler_controller.go:679] "Failed to update PodAutoscaler status" err="Operation cannot be fulfilled on podautoscalers.autoscaling.aibrix.ai \"pa-resolve-1\": the object has been modified; please apply your changes to the latest version and try again"
E1019 21:54:04.929875   73858 podautoscaler_controller.go:679] "Failed to update PodAutoscaler status" err="Operation cannot be fulfilled on podautoscalers.autoscaling.aibrix.ai \"pa-resolve-2\": the object has been modified; please apply your changes to the latest version and try again"
•E1019 21:54:07.944875   73858 podautoscaler_controller.go:679] "Failed to update PodAutoscaler status" err="Operation cannot be fulfilled on podautoscalers.autoscaling.aibrix.ai \"pa-status-scale\": the object has been modified; please apply your changes to the latest version and try again"
•E1019 21:54:09.954884   73858 podautoscaler_controller.go:679] "Failed to update PodAutoscaler status" err="Operation cannot be fulfilled on podautoscalers.autoscaling.aibrix.ai \"pa-condition-able\": the object has been modified; please apply your changes to the latest version and try again"
•E1019 21:54:11.973389   73858 podautoscaler_controller.go:679] "Failed to update PodAutoscaler status" err="Operation cannot be fulfilled on podautoscalers.autoscaling.aibrix.ai \"pa-condition-ready\": the object has been modified; please apply your changes to the latest version and try again"
•E1019 21:54:13.988440   73858 podautoscaler_controller.go:679] "Failed to update PodAutoscaler status" err="Operation cannot be fulfilled on podautoscalers.autoscaling.aibrix.ai \"pa-scale-deployment\": the object has been modified; please apply your changes to the latest version and try again"
•E1019 21:54:16.004412   73858 podautoscaler_controller.go:679] "Failed to update PodAutoscaler status" err="Operation cannot be fulfilled on podautoscalers.autoscaling.aibrix.ai \"pa-target-notfound\": the object has been modified; please apply your changes to the latest version and try again"
•E1019 21:54:18.020858   73858 podautoscaler_controller.go:679] "Failed to update PodAutoscaler status" err="Operation cannot be fulfilled on podautoscalers.autoscaling.aibrix.ai \"pa-boundary-max\": the object has been modified; please apply your changes to the latest version and try again"
•E1019 21:54:20.031714   73858 podautoscaler_controller.go:679] "Failed to update PodAutoscaler status" err="Operation cannot be fulfilled on podautoscalers.autoscaling.aibrix.ai \"pa-boundary-min\": the object has been modified; please apply your changes to the latest version and try again"
•E1019 21:54:22.048919   73858 podautoscaler_controller.go:679] "Failed to update PodAutoscaler status" err="Operation cannot be fulfilled on podautoscalers.autoscaling.aibrix.ai \"pa-boundary-zero\": the object has been modified; please apply your changes to the latest version and try again"
•E1019 21:54:24.064851   73858 podautoscaler_controller.go:679] "Failed to update PodAutoscaler status" err="Operation cannot be fulfilled on podautoscalers.autoscaling.aibrix.ai \"pa-history\": the object has been modified; please apply your changes to the latest version and try again"
•E1019 21:54:27.081931   73858 podautoscaler_controller.go:679] "Failed to update PodAutoscaler status" err="Operation cannot be fulfilled on podautoscalers.autoscaling.aibrix.ai \"pa-ss-replica\": the object has been modified; please apply your changes to the latest version and try again"
•••••••E1019 21:54:48.157126   73858 podautoscaler_controller.go:679] "Failed to update PodAutoscaler status" err="Operation cannot be fulfilled on podautoscalers.autoscaling.aibrix.ai \"pa-rapid-updates\": the object has been modified; please apply your changes to the latest version and try again"
•

Ran 31 of 31 Specs in 94.920 seconds
SUCCESS! -- 31 Passed | 0 Failed | 0 Pending | 0 Skipped
--- PASS: TestAPIs (94.92s)
PASS
ok      github.com/vllm-project/aibrix/test/integration/controller      95.651s

Additional Information

  • Related Controller: pkg/controller/autoscaling/podautoscaler_controller.go

Copy link
Contributor

Summary of Changes

Hello @zhenyu-02, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the testing infrastructure for the PodAutoscaler controller by introducing a suite of dedicated integration test utilities. These utilities streamline the process of validating the controller's behavior, ensuring its robustness and correctness across various scenarios, including interactions with HorizontalPodAutoscalers and custom resource scaling for StormServices. The new framework allows for more efficient and comprehensive testing, ultimately improving the reliability of the autoscaling mechanisms.

Highlights

  • Comprehensive Validation Utilities: New helper libraries provide extensive validation functions for PodAutoscaler specifications, status, conditions, and scaling decisions.
  • HPA Interaction Testing: Utilities are introduced to verify the creation, updates, and synchronization of HorizontalPodAutoscalers (HPAs) with the PodAutoscaler controller.
  • Simplified Test Fixture Creation: A fluent API wrapper simplifies the creation of PodAutoscaler test objects, reducing boilerplate and improving test setup efficiency.
  • Extensive Integration Test Coverage: The changes enable 31 new integration tests covering critical aspects such as HPA lifecycle management, spec validation, conflict detection, status updates, boundary enforcement, and StormService scaling.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a comprehensive set of integration test utilities and tests for the PodAutoscaler controller. The use of helper libraries for validation and a fluent wrapper for test object creation is excellent practice and greatly improves the readability and maintainability of the tests. The test coverage is extensive. I've identified a few areas for improvement, including a minor typo, replacing time.Sleep with more robust Eventually blocks to prevent flaky tests, and removing a duplicated helper function. Overall, this is a high-quality contribution that significantly improves the testing infrastructure.

const (
ConditionReady = "Ready"
ConditionValidSpec = "ValidSpec"
ConditionConflict = "MutilPodAutoscalerConflict"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

There's a typo in the constant name ConditionConflict. "Mutil" should be "Multi". This should be corrected to MultiPodAutoscalerConflict for clarity and consistency. This constant appears to be copied from the controller logic, so it should be updated there as well for consistency across the codebase.

Suggested change
ConditionConflict = "MutilPodAutoscalerConflict"
ConditionConflict = "MultiPodAutoscalerConflict"

Comment on lines 201 to 207
time.Sleep(time.Second * 3) // Wait for initial reconcile
fetched := validation.GetPodAutoscaler(ctx, k8sClient, pa)
minReplicas := int32(2)
fetched.Spec.MinReplicas = &minReplicas
fetched.Spec.MaxReplicas = 10
gomega.Expect(k8sClient.Update(ctx, fetched)).To(gomega.Succeed())
time.Sleep(time.Second * 3) // Wait for update to propagate
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Using time.Sleep in tests can make them flaky and slow. Instead of waiting for a fixed duration, it's better to use gomega.Eventually to retry operations until they succeed or a timeout is reached. This is especially useful for handling "the object has been modified" errors that can occur when the test and the controller race to update the same object. The current implementation with time.Sleep is not a guaranteed fix for these race conditions and makes the test suite slower. This pattern of replacing time.Sleep with Eventually should be applied throughout the test file where applicable.

Suggested change
time.Sleep(time.Second * 3) // Wait for initial reconcile
fetched := validation.GetPodAutoscaler(ctx, k8sClient, pa)
minReplicas := int32(2)
fetched.Spec.MinReplicas = &minReplicas
fetched.Spec.MaxReplicas = 10
gomega.Expect(k8sClient.Update(ctx, fetched)).To(gomega.Succeed())
time.Sleep(time.Second * 3) // Wait for update to propagate
gomega.Eventually(func() error {
fetched := validation.GetPodAutoscaler(ctx, k8sClient, pa)
minReplicas := int32(2)
fetched.Spec.MinReplicas = &minReplicas
fetched.Spec.MaxReplicas = 10
return k8sClient.Update(ctx, fetched)
}, time.Second*10, time.Millisecond*250).Should(gomega.Succeed())

Comment on lines 115 to 131
// ValidatePodAutoscalerScalingEventually validates scaling status and waits for eventual consistency.
func ValidatePodAutoscalerScalingEventually(ctx context.Context,
k8sClient client.Client,
pa *autoscalingv1alpha1.PodAutoscaler,
expectedDesired, expectedActual int32) {

gomega.Eventually(func(g gomega.Gomega) {
fetched := &autoscalingv1alpha1.PodAutoscaler{}
err := k8sClient.Get(ctx, client.ObjectKeyFromObject(pa), fetched)
g.Expect(err).ToNot(gomega.HaveOccurred())

g.Expect(fetched.Status.DesiredScale).To(gomega.Equal(expectedDesired),
"DesiredScale should be %d", expectedDesired)
g.Expect(fetched.Status.ActualScale).To(gomega.Equal(expectedActual),
"ActualScale should be %d", expectedActual)
}, time.Second*10, time.Millisecond*250).Should(gomega.Succeed())
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The function ValidatePodAutoscalerScalingEventually is an exact duplicate of ValidatePodAutoscalerScaling. This duplication should be removed to improve maintainability. Since ValidatePodAutoscalerScaling already uses gomega.Eventually, it serves the purpose of waiting for eventual consistency, and the ...Eventually suffix is redundant.

@googs1025 googs1025 self-assigned this Oct 20, 2025
@googs1025
Copy link
Collaborator

@zhenyu-02 thanks 😄
can you also fix DCO ?

Comment on lines +254 to +376
// Note: In envtest, HPA cascade deletion via OwnerReference doesn't work
// because garbage collector controller is not running. In real K8s,
// the HPA would be automatically deleted due to OwnerReference.
// We already verified OwnerReference is set correctly in the creation test.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sgtm

@zhenyu-02 zhenyu-02 force-pushed the feat-add-integrate-test-pa branch from 4285d0e to 6809163 Compare October 20, 2025 01:51
@zhenyu-02
Copy link
Contributor Author

@zhenyu-02 thanks 😄 can you also fix DCO ?

fixed

@googs1025
Copy link
Collaborator

@zhenyu-02 please fix the ci-lint

@zhenyu-02
Copy link
Contributor Author

@zhenyu-02 please fix the ci-lint

fixed

Signed-off-by: Wang Zhenyu <ts-zhenyu.b.wang@rakuten.com>
Signed-off-by: Wang Zhenyu <ts-zhenyu.b.wang@rakuten.com>
Signed-off-by: Wang Zhenyu <ts-zhenyu.b.wang@rakuten.com>
Signed-off-by: Wang Zhenyu <ts-zhenyu.b.wang@rakuten.com>
@googs1025 googs1025 force-pushed the feat-add-integrate-test-pa branch from 4b3bff1 to 7eba636 Compare October 21, 2025 02:32
@googs1025 googs1025 merged commit bda162d into vllm-project:main Oct 21, 2025
14 checks passed
Copy link
Collaborator

@googs1025 googs1025 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM thanks!!

@zhenyu-02 zhenyu-02 deleted the feat-add-integrate-test-pa branch October 21, 2025 06:31
@zhenyu-02
Copy link
Contributor Author

@googs1025 If there are any other issue need help, I am very willing to contribute.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants