[Misc] Add Integration Test Utilities for PodAutoscaler Controller #1682

zhenyu-02 · 2025-10-19T14:16:02Z

Pull Request Title

[Misc] Add Integration Test Utilities for PodAutoscaler Controller

Pull Request Description

This PR introduces test utilities to support integration testing for the PodAutoscaler controller. The changes include three new helper libraries that provide a clean, reusable API for:

Validating PodAutoscaler behavior: Comprehensive validation functions for spec, status, conditions, and scaling decisions
Testing HPA interactions: Utilities to verify HPA creation, updates, and synchronization with PodAutoscaler
Creating test fixtures: Fluent API wrapper for building PodAutoscaler test objects with minimal boilerplate

These utilities enable 31 comprehensive integration tests covering HPA strategy, spec validation, conflict detection, status management, boundary enforcement, StormService scaling, annotation-based configuration, and advanced scenarios.

Related Issues

issue 1650

Changes

Files Added

test/utils/validation/hpa.go
- Validation functions for HorizontalPodAutoscaler resources
- Verifies HPA spec, status, owner references, and scale target references
- Provides wait utilities for HPA creation and updates
test/utils/validation/podautoscaler.go
- Comprehensive validation functions for PodAutoscaler resources
- Validates spec fields (min/max replicas, scaling strategy)
- Validates status conditions (Ready, ValidSpec, AbleToScale, etc.)
- Validates scaling history and decision tracking
- Provides Eventually-based wait helpers for async operations
test/utils/wrapper/podautoscaler.go
- Fluent API wrapper for creating PodAutoscaler test fixtures
- Simplifies test setup with chained method calls
- Includes builders for various metric source types (POD, RESOURCE, EXTERNAL, CUSTOM)
- Supports all PodAutoscaler configuration options (annotations, labels, sub-target selectors)

Test Coverage

The integration test suite (podautoscaler_test.go) covers 31 test scenarios across the following categories:

1. HPA Strategy - Resource Lifecycle (3 tests)

✅ HPA creation when PodAutoscaler is created
✅ HPA synchronization when PodAutoscaler spec is updated
✅ HPA cascade deletion when PodAutoscaler is deleted

2. Spec Validation Logic (3 tests)

✅ Invalid ScaleTargetRef detection (empty name)
✅ Invalid replica bounds detection (min > max)
✅ Valid spec acceptance

3. Conflict Detection Mechanism (2 tests)

✅ Conflict detection when two PodAutoscalers target the same resource
✅ Conflict resolution after deleting conflicting PodAutoscaler

4. Status and Condition Management (3 tests)

✅ DesiredScale and ActualScale status updates
✅ AbleToScale condition management
✅ Ready condition state transitions

5. Scale Target Management (2 tests)

✅ Deployment scaling support
✅ Graceful handling of non-existent target resources

6. Boundary Enforcement (3 tests)

✅ maxReplicas enforcement in HPA
✅ minReplicas enforcement in HPA
✅ minReplicas=0 special case handling

7. Scaling History Management (1 test)

✅ Scaling history tracking with size limits

8. StormService Scaling (3 tests)

✅ Replica-mode scaling (entire StormService)
✅ Role-level scaling with SubTargetSelector
✅ Role-level conflict detection

9. Annotation-Based Configuration (3 tests)

✅ Scale-up cooldown annotation support
✅ Scale-down delay annotation support
✅ Multiple KPA annotations (panic-threshold, panic-window, stable-window, tolerance)

10. Advanced Scenarios (2 tests)

✅ Spec update reconciliation
✅ Multiple rapid updates handling

Additional Tests (6 tests)

✅ StormService, RoleSet, and PodSet controller integration tests

Test Results

Note: transient “the object has been modified” warnings are expected under concurrent reconciles; all specs passed.

W1019 21:53:21.831912   73858 util.go:89] environment variable AIBRIX_POD_DEPLOYMENT_LABEL is not set, using default value: app.kubernetes.io/name
W1019 21:53:21.832303   73858 util.go:89] environment variable AIBRIX_POD_RAYCLUSTERFLEET_LABEL is not set, using default value: orchestration.aibrix.ai/raycluster-fleet-name
I1019 21:53:21.885714   73858 util.go:106] set AIBRIX_SYNC_MAX_CONTEXTS: 1000, using default value
I1019 21:53:21.885730   73858 util.go:106] set AIBRIX_SYNC_MAX_PREFIXES_PER_CONTEXT: 10000, using default value
I1019 21:53:21.885731   73858 util.go:106] set AIBRIX_SYNC_EVICTION_INTERVAL_SECONDS: 60, using default value
I1019 21:53:21.885733   73858 util.go:106] set AIBRIX_SYNC_EVICTION_DURATION_MINUTES: 20, using default value
I1019 21:53:21.885734   73858 util.go:106] set AIBRIX_PREFIX_CACHE_BLOCK_SIZE: 16, using default value
I1019 21:53:21.885757   73858 util.go:106] set AIBRIX_POD_METRIC_REFRESH_INTERVAL_MS: 50, using default value
W1019 21:53:21.885838   73858 util.go:89] environment variable AIBRIX_MODEL_GPU_PROFILE_CACHING_FLAG is not set, using default value: true
W1019 21:53:21.885910   73858 util.go:89] environment variable AIBRIX_GPU_OPTIMIZER_TRACING_FLAG is not set, using default value: false
=== RUN   TestAPIs
Running Suite: Controller Suite - /Users/ts-zhenyu.b.wang/self-aibrix/aibrix/test/integration/controller
========================================================================================================
Random Seed: 1760882001

Will run 31 of 31 specs
••••••E1019 21:53:41.053270   73858 podautoscaler_controller.go:679] "Failed to update PodAutoscaler status" err="Operation cannot be fulfilled on podautoscalers.autoscaling.aibrix.ai \"pa-hpa-create\": the object has been modified; please apply your changes to the latest version and try again"
•E1019 21:53:41.319505   73858 podautoscaler_controller.go:679] "Failed to update PodAutoscaler status" err="Operation cannot be fulfilled on podautoscalers.autoscaling.aibrix.ai \"pa-hpa-update\": the object has been modified; please apply your changes to the latest version and try again"
•E1019 21:53:47.590559   73858 podautoscaler_controller.go:679] "Failed to update PodAutoscaler status" err="Operation cannot be fulfilled on podautoscalers.autoscaling.aibrix.ai \"pa-hpa-delete\": the object has been modified; please apply your changes to the latest version and try again"
•••E1019 21:53:54.879103   73858 podautoscaler_controller.go:679] "Failed to update PodAutoscaler status" err="Operation cannot be fulfilled on podautoscalers.autoscaling.aibrix.ai \"pa-valid-spec\": the object has been modified; please apply your changes to the latest version and try again"
•E1019 21:53:56.896152   73858 podautoscaler_controller.go:679] "Failed to update PodAutoscaler status" err="Operation cannot be fulfilled on podautoscalers.autoscaling.aibrix.ai \"pa-conflict-1\": the object has been modified; please apply your changes to the latest version and try again"
•E1019 21:54:00.912228   73858 podautoscaler_controller.go:679] "Failed to update PodAutoscaler status" err="Operation cannot be fulfilled on podautoscalers.autoscaling.aibrix.ai \"pa-resolve-1\": the object has been modified; please apply your changes to the latest version and try again"
E1019 21:54:04.929875   73858 podautoscaler_controller.go:679] "Failed to update PodAutoscaler status" err="Operation cannot be fulfilled on podautoscalers.autoscaling.aibrix.ai \"pa-resolve-2\": the object has been modified; please apply your changes to the latest version and try again"
•E1019 21:54:07.944875   73858 podautoscaler_controller.go:679] "Failed to update PodAutoscaler status" err="Operation cannot be fulfilled on podautoscalers.autoscaling.aibrix.ai \"pa-status-scale\": the object has been modified; please apply your changes to the latest version and try again"
•E1019 21:54:09.954884   73858 podautoscaler_controller.go:679] "Failed to update PodAutoscaler status" err="Operation cannot be fulfilled on podautoscalers.autoscaling.aibrix.ai \"pa-condition-able\": the object has been modified; please apply your changes to the latest version and try again"
•E1019 21:54:11.973389   73858 podautoscaler_controller.go:679] "Failed to update PodAutoscaler status" err="Operation cannot be fulfilled on podautoscalers.autoscaling.aibrix.ai \"pa-condition-ready\": the object has been modified; please apply your changes to the latest version and try again"
•E1019 21:54:13.988440   73858 podautoscaler_controller.go:679] "Failed to update PodAutoscaler status" err="Operation cannot be fulfilled on podautoscalers.autoscaling.aibrix.ai \"pa-scale-deployment\": the object has been modified; please apply your changes to the latest version and try again"
•E1019 21:54:16.004412   73858 podautoscaler_controller.go:679] "Failed to update PodAutoscaler status" err="Operation cannot be fulfilled on podautoscalers.autoscaling.aibrix.ai \"pa-target-notfound\": the object has been modified; please apply your changes to the latest version and try again"
•E1019 21:54:18.020858   73858 podautoscaler_controller.go:679] "Failed to update PodAutoscaler status" err="Operation cannot be fulfilled on podautoscalers.autoscaling.aibrix.ai \"pa-boundary-max\": the object has been modified; please apply your changes to the latest version and try again"
•E1019 21:54:20.031714   73858 podautoscaler_controller.go:679] "Failed to update PodAutoscaler status" err="Operation cannot be fulfilled on podautoscalers.autoscaling.aibrix.ai \"pa-boundary-min\": the object has been modified; please apply your changes to the latest version and try again"
•E1019 21:54:22.048919   73858 podautoscaler_controller.go:679] "Failed to update PodAutoscaler status" err="Operation cannot be fulfilled on podautoscalers.autoscaling.aibrix.ai \"pa-boundary-zero\": the object has been modified; please apply your changes to the latest version and try again"
•E1019 21:54:24.064851   73858 podautoscaler_controller.go:679] "Failed to update PodAutoscaler status" err="Operation cannot be fulfilled on podautoscalers.autoscaling.aibrix.ai \"pa-history\": the object has been modified; please apply your changes to the latest version and try again"
•E1019 21:54:27.081931   73858 podautoscaler_controller.go:679] "Failed to update PodAutoscaler status" err="Operation cannot be fulfilled on podautoscalers.autoscaling.aibrix.ai \"pa-ss-replica\": the object has been modified; please apply your changes to the latest version and try again"
•••••••E1019 21:54:48.157126   73858 podautoscaler_controller.go:679] "Failed to update PodAutoscaler status" err="Operation cannot be fulfilled on podautoscalers.autoscaling.aibrix.ai \"pa-rapid-updates\": the object has been modified; please apply your changes to the latest version and try again"
•

Ran 31 of 31 Specs in 94.920 seconds
SUCCESS! -- 31 Passed | 0 Failed | 0 Pending | 0 Skipped
--- PASS: TestAPIs (94.92s)
PASS
ok      github.com/vllm-project/aibrix/test/integration/controller      95.651s

Additional Information

Related Controller: pkg/controller/autoscaling/podautoscaler_controller.go

gemini-code-assist · 2025-10-19T14:16:18Z

Summary of Changes

Hello @zhenyu-02, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the testing infrastructure for the PodAutoscaler controller by introducing a suite of dedicated integration test utilities. These utilities streamline the process of validating the controller's behavior, ensuring its robustness and correctness across various scenarios, including interactions with HorizontalPodAutoscalers and custom resource scaling for StormServices. The new framework allows for more efficient and comprehensive testing, ultimately improving the reliability of the autoscaling mechanisms.

Highlights

Comprehensive Validation Utilities: New helper libraries provide extensive validation functions for PodAutoscaler specifications, status, conditions, and scaling decisions.
HPA Interaction Testing: Utilities are introduced to verify the creation, updates, and synchronization of HorizontalPodAutoscalers (HPAs) with the PodAutoscaler controller.
Simplified Test Fixture Creation: A fluent API wrapper simplifies the creation of PodAutoscaler test objects, reducing boilerplate and improving test setup efficiency.
Extensive Integration Test Coverage: The changes enable 31 new integration tests covering critical aspects such as HPA lifecycle management, spec validation, conflict detection, status updates, boundary enforcement, and StormService scaling.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces a comprehensive set of integration test utilities and tests for the PodAutoscaler controller. The use of helper libraries for validation and a fluent wrapper for test object creation is excellent practice and greatly improves the readability and maintainability of the tests. The test coverage is extensive. I've identified a few areas for improvement, including a minor typo, replacing time.Sleep with more robust Eventually blocks to prevent flaky tests, and removing a duplicated helper function. Overall, this is a high-quality contribution that significantly improves the testing infrastructure.

gemini-code-assist · 2025-10-19T14:17:46Z

test/integration/controller/podautoscaler_test.go

+const (
+	ConditionReady         = "Ready"
+	ConditionValidSpec     = "ValidSpec"
+	ConditionConflict      = "MutilPodAutoscalerConflict"


There's a typo in the constant name ConditionConflict. "Mutil" should be "Multi". This should be corrected to MultiPodAutoscalerConflict for clarity and consistency. This constant appears to be copied from the controller logic, so it should be updated there as well for consistency across the codebase.

Suggested change

ConditionConflict = "MutilPodAutoscalerConflict"

ConditionConflict = "MultiPodAutoscalerConflict"

gemini-code-assist · 2025-10-19T14:17:46Z

test/integration/controller/podautoscaler_test.go

+							time.Sleep(time.Second * 3) // Wait for initial reconcile
+							fetched := validation.GetPodAutoscaler(ctx, k8sClient, pa)
+							minReplicas := int32(2)
+							fetched.Spec.MinReplicas = &minReplicas
+							fetched.Spec.MaxReplicas = 10
+							gomega.Expect(k8sClient.Update(ctx, fetched)).To(gomega.Succeed())
+							time.Sleep(time.Second * 3) // Wait for update to propagate


Using time.Sleep in tests can make them flaky and slow. Instead of waiting for a fixed duration, it's better to use gomega.Eventually to retry operations until they succeed or a timeout is reached. This is especially useful for handling "the object has been modified" errors that can occur when the test and the controller race to update the same object. The current implementation with time.Sleep is not a guaranteed fix for these race conditions and makes the test suite slower. This pattern of replacing time.Sleep with Eventually should be applied throughout the test file where applicable.

Suggested change

time.Sleep(time.Second * 3) // Wait for initial reconcile

fetched := validation.GetPodAutoscaler(ctx, k8sClient, pa)

minReplicas := int32(2)

fetched.Spec.MinReplicas = &minReplicas

fetched.Spec.MaxReplicas = 10

gomega.Expect(k8sClient.Update(ctx, fetched)).To(gomega.Succeed())

time.Sleep(time.Second * 3) // Wait for update to propagate

gomega.Eventually(func() error {

fetched := validation.GetPodAutoscaler(ctx, k8sClient, pa)

minReplicas := int32(2)

fetched.Spec.MinReplicas = &minReplicas

fetched.Spec.MaxReplicas = 10

return k8sClient.Update(ctx, fetched)

}, time.Second*10, time.Millisecond*250).Should(gomega.Succeed())

gemini-code-assist · 2025-10-19T14:17:46Z

test/utils/validation/podautoscaler.go

+// ValidatePodAutoscalerScalingEventually validates scaling status and waits for eventual consistency.
+func ValidatePodAutoscalerScalingEventually(ctx context.Context,
+	k8sClient client.Client,
+	pa *autoscalingv1alpha1.PodAutoscaler,
+	expectedDesired, expectedActual int32) {
+
+	gomega.Eventually(func(g gomega.Gomega) {
+		fetched := &autoscalingv1alpha1.PodAutoscaler{}
+		err := k8sClient.Get(ctx, client.ObjectKeyFromObject(pa), fetched)
+		g.Expect(err).ToNot(gomega.HaveOccurred())
+
+		g.Expect(fetched.Status.DesiredScale).To(gomega.Equal(expectedDesired),
+			"DesiredScale should be %d", expectedDesired)
+		g.Expect(fetched.Status.ActualScale).To(gomega.Equal(expectedActual),
+			"ActualScale should be %d", expectedActual)
+	}, time.Second*10, time.Millisecond*250).Should(gomega.Succeed())
+}


The function ValidatePodAutoscalerScalingEventually is an exact duplicate of ValidatePodAutoscalerScaling. This duplication should be removed to improve maintainability. Since ValidatePodAutoscalerScaling already uses gomega.Eventually, it serves the purpose of waiting for eventual consistency, and the ...Eventually suffix is redundant.

googs1025 · 2025-10-20T00:50:29Z

@zhenyu-02 thanks 😄
can you also fix DCO ?

googs1025 · 2025-10-20T00:56:43Z

test/integration/controller/podautoscaler_test.go

+							// Note: In envtest, HPA cascade deletion via OwnerReference doesn't work
+							// because garbage collector controller is not running. In real K8s,
+							// the HPA would be automatically deleted due to OwnerReference.
+							// We already verified OwnerReference is set correctly in the creation test.


zhenyu-02 · 2025-10-20T01:52:38Z

@zhenyu-02 thanks 😄 can you also fix DCO ?

fixed

googs1025 · 2025-10-20T05:24:15Z

@zhenyu-02 please fix the ci-lint

zhenyu-02 · 2025-10-21T01:46:48Z

@zhenyu-02 please fix the ci-lint

fixed

Signed-off-by: Wang Zhenyu <ts-zhenyu.b.wang@rakuten.com>

googs1025

LGTM thanks!!

zhenyu-02 · 2025-10-21T06:32:09Z

@googs1025 If there are any other issue need help, I am very willing to contribute.

gemini-code-assist bot reviewed Oct 19, 2025

View reviewed changes

googs1025 self-assigned this Oct 20, 2025

googs1025 reviewed Oct 20, 2025

View reviewed changes

zhenyu-02 force-pushed the feat-add-integrate-test-pa branch from 4285d0e to 6809163 Compare October 20, 2025 01:51

zhenyu-at-rakuten added 4 commits October 21, 2025 10:32

feat: integration test for podAutoScaler

25f282c

Signed-off-by: Wang Zhenyu <ts-zhenyu.b.wang@rakuten.com>

feat: integration test for podAutoScaler

e91c393

Signed-off-by: Wang Zhenyu <ts-zhenyu.b.wang@rakuten.com>

fix: correct typo, add Eventually, delete duplicate Func

9f9eafb

Signed-off-by: Wang Zhenyu <ts-zhenyu.b.wang@rakuten.com>

fix: pass lint check

7eba636

Signed-off-by: Wang Zhenyu <ts-zhenyu.b.wang@rakuten.com>

googs1025 force-pushed the feat-add-integrate-test-pa branch from 4b3bff1 to 7eba636 Compare October 21, 2025 02:32

googs1025 merged commit bda162d into vllm-project:main Oct 21, 2025
14 checks passed

googs1025 approved these changes Oct 21, 2025

View reviewed changes

zhenyu-02 deleted the feat-add-integrate-test-pa branch October 21, 2025 06:31

	ConditionConflict = "MutilPodAutoscalerConflict"
	ConditionConflict = "MultiPodAutoscalerConflict"

[Misc] Add Integration Test Utilities for PodAutoscaler Controller #1682

[Misc] Add Integration Test Utilities for PodAutoscaler Controller #1682

Uh oh!

Conversation

zhenyu-02 commented Oct 19, 2025

Pull Request Title

Pull Request Description

Related Issues

Changes

Files Added

Test Coverage

1. HPA Strategy - Resource Lifecycle (3 tests)

2. Spec Validation Logic (3 tests)

3. Conflict Detection Mechanism (2 tests)

4. Status and Condition Management (3 tests)

5. Scale Target Management (2 tests)

6. Boundary Enforcement (3 tests)

7. Scaling History Management (1 test)

8. StormService Scaling (3 tests)

9. Annotation-Based Configuration (3 tests)

10. Advanced Scenarios (2 tests)

Additional Tests (6 tests)

Test Results

Additional Information

Uh oh!

gemini-code-assist bot commented Oct 19, 2025

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Oct 19, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Oct 19, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Oct 19, 2025

Choose a reason for hiding this comment

Uh oh!

googs1025 commented Oct 20, 2025

Uh oh!

googs1025 Oct 20, 2025

Choose a reason for hiding this comment

Uh oh!

zhenyu-02 commented Oct 20, 2025

Uh oh!

googs1025 commented Oct 20, 2025

Uh oh!

zhenyu-02 commented Oct 21, 2025

Uh oh!

Uh oh!

googs1025 left a comment

Choose a reason for hiding this comment

Uh oh!

zhenyu-02 commented Oct 21, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants