Skip to content

Conversation

@JacksonMei
Copy link
Collaborator

No description provided.

JacksonMei and others added 7 commits January 29, 2026 12:41
## Changes
- Reduce QPS from 1000 to 5, Burst from 1000 to 10
- Implement lazy REST mapper to avoid expensive CRD discovery
- Use shared clientset across all handlers
- Optimize pod cache with async initialization
- Add namespace scoping to manager

## Enhanced Logging
- Added 🔧 emoji marker for rate limiting config confirmation
- Added 🚀 emoji marker for lazy REST mapper creation
- Added ✅ emoji marker for successful initialization
- Added 🔗 emoji marker for shared clientset creation
- Added 🎯 emoji marker for optimized ListWatcher usage

These logs make it easy to verify the fix is deployed and active.

## Root Cause
In large clusters with 300+ CRDs, aggressive QPS (1000) caused
'too many requests' errors from K8s API server, breaking
'aenv service list' and other operations.

## Verification
Look for these log markers on startup:
- 🔧 API Rate Limiting configured: QPS=5, Burst=10
- 🚀 Creating lazy REST mapper
- 🔗 Creating shared Kubernetes clientset
- 🎯 Using optimized ListWatcher

Fixes: aenv service list 500 error

Co-Authored-By: Claude (claude-sonnet-4-5) <noreply@anthropic.com>
## Problem
With QPS=5 and Burst=10, the shared rate limiter was too restrictive:
- Pod reflector continuously retried list operations
- Service list requests competed for the same QPS quota
- Both operations failed with "too many requests"

## Solution
Increase to QPS=20, Burst=40 - a more balanced approach that:
- Allows background cache sync to proceed
- Leaves headroom for user-initiated requests
- Still conservative enough for large clusters

## Testing
The eu126-sqa cluster has very high API server load. Previous
QPS=5 was too low for even basic operations to succeed.

Co-Authored-By: Claude (claude-sonnet-4-5) <noreply@anthropic.com>
## Problem
API server may apply stricter rate limits to custom UserAgent strings.
The "aenv-controller" UserAgent might be treated as a batch client.

## Solution
Change UserAgent from "aenv-controller" to kubectl-compatible format:
"kubectl/v1.26.0 (aenv-controller) kubernetes/compatible"

This makes the controller appear as a standard kubectl client while
maintaining identifiability via the parenthetical annotation.

## Hypothesis
K8s API server may have per-UserAgent rate limiting policies where:
- Standard kubectl clients get more lenient limits
- Custom clients get stricter limits to prevent abuse

## Verification
Look for updated UserAgent in logs:
🔧 API Rate Limiting configured: ... UserAgent=kubectl/v1.26.0...

Co-Authored-By: Claude (claude-sonnet-4-5) <noreply@anthropic.com>
Revert UserAgent changes for analysis purposes.
UserAgent change was proven to bypass APF rate limiting,
but keeping original value to investigate CLI issues.

Co-Authored-By: Claude (claude-sonnet-4-5) <noreply@anthropic.com>
## Bug
When API returns empty service list:
{"success": true, "code": 0, "data": []}

The condition 'api_response.success and api_response.data' evaluates
to False because empty list [] is falsy in Python.

This causes EnvironmentError with "Unknown error" message.

## Fix
Change condition from:
  if api_response.success and api_response.data:

To:
  if api_response.success:

Now empty list is treated as valid successful response.

## Impact
- aenv service list now works correctly when no services exist
- Returns "No running services found" instead of "Unknown error"

Fixes: CLI returning "Unknown error" for empty service list

Co-Authored-By: Claude (claude-sonnet-4-5) <noreply@anthropic.com>
- UserAgent rate limiting analysis
- CLI empty list bug analysis and fix
- Complete troubleshooting guides

Co-Authored-By: Claude (claude-sonnet-4-5) <noreply@anthropic.com>
@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @JacksonMei, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the stability and efficiency of the aenv controller and its associated CLI. It introduces critical optimizations to how the controller interacts with the Kubernetes API, including the adoption of shared clientsets, refined API request rates, and a more efficient pod caching mechanism. Concurrently, a client-side bug that incorrectly processed empty API responses has been resolved. The changes are complemented by extensive documentation, providing a new frontend design plan and detailed analyses of the technical challenges addressed.

Highlights

  • Kubernetes API Client Optimization: Implemented a shared Kubernetes clientset, adjusted default QPS/Burst rates, and introduced a lazy REST mapper to significantly reduce API server load and avoid rate limiting, especially in high-load clusters.
  • Efficient Pod Caching: Refactored the Pod cache mechanism to use a more efficient ListWatchFromClient instead of SharedInformerFactory, reducing unnecessary resource watching and improving informer efficiency.
  • CLI Bug Resolution: Fixed a Python client bug that caused an "Unknown error" when the API returned an empty list of services, improving the robustness and user experience of the aenv CLI.
  • Comprehensive Documentation: Added detailed design and troubleshooting documents covering a new frontend design, an in-depth analysis of Kubernetes API rate limiting, and the specifics of the CLI bug fix.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces significant improvements to the controller's interaction with the Kubernetes API server, addressing performance and rate-limiting issues. Key changes include implementing a shared clientset for handlers, introducing a lazy REST mapper to optimize startup, and refining the pod cache mechanism. These changes should make the controller more resilient and performant, especially in large clusters. The PR also includes a crucial bug fix in the Python client for handling empty API responses and adds valuable troubleshooting documentation. The overall changes are excellent. I have a few minor suggestions for improvement.

Comment on lines +86 to +94
// Start async sync watcher
go func() {
klog.Infof("Waiting for pod cache sync (namespace: %s)...", namespace)
if !cache.WaitForCacheSync(stopCh, informer.HasSynced) {
klog.Errorf("failed to wait for pod cache sync in namespace %s", namespace)
return
}
klog.Infof("Pod cache sync completed (namespace: %s), number of pods: %d", namespace, len(podCache.cache.ListKeys()))
}()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Changing the cache synchronization to be asynchronous is a great improvement for startup performance and resilience. However, this introduces a time window where the cache is not yet synced, and list operations might return incomplete data. It would be beneficial to expose the sync status, for example by adding an IsSynced() bool method to AEnvPodCache, so that callers like listPod can handle this state gracefully (e.g., by returning a '503 Service Unavailable' if the cache is not ready).

Comment on lines 204 to 206
// GET /env-instance/:id/list (id can be * for all)
list: (envName?: string) =>
apiClient.get<EnvInstance[]>(`/env-instance/${envName || '*'}/list`),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The comment for the list method in instanceApi is a bit misleading. It says // GET /env-instance/:id/list, but the implementation uses envName as the path parameter, not id. To improve clarity and prevent confusion during implementation, it would be best to update the comment to reflect the use of envName.

Suggested change
// GET /env-instance/:id/list (id can be * for all)
list: (envName?: string) =>
apiClient.get<EnvInstance[]>(`/env-instance/${envName || '*'}/list`),
// GET /env-instance/:envName/list (envName can be * for all)
list: (envName?: string) =>
apiClient.get<EnvInstance[]>(`/env-instance/${envName || '*'}/list`),

Comment on lines 264 to 266
// GET /env-service/:id/list (id can be * for all)
list: (envName?: string) =>
apiClient.get<EnvService[]>(`/env-service/${envName || '*'}/list`),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Similar to the instanceApi, the comment for the list method in serviceApi is inconsistent with the implementation. The comment mentions :id while the code uses :envName. Updating the comment will ensure the design document is accurate and clear for developers.

Suggested change
// GET /env-service/:id/list (id can be * for all)
list: (envName?: string) =>
apiClient.get<EnvService[]>(`/env-service/${envName || '*'}/list`),
// GET /env-service/:envName/list (envName can be * for all)
list: (envName?: string) =>
apiClient.get<EnvService[]>(`/env-service/${envName || '*'}/list`),

@lanmaoxinqing lanmaoxinqing self-assigned this Jan 29, 2026
@lanmaoxinqing lanmaoxinqing self-requested a review January 29, 2026 08:06
@lanmaoxinqing lanmaoxinqing removed their assignment Jan 29, 2026
@lanmaoxinqing
Copy link
Collaborator

LGTM

@lanmaoxinqing lanmaoxinqing merged commit ae5839d into main Jan 29, 2026
1 check passed
@lanmaoxinqing lanmaoxinqing deleted the fix/controller branch January 29, 2026 08:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants