-
Notifications
You must be signed in to change notification settings - Fork 26
Create a shared clientset for pod and service #60
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
## Changes - Reduce QPS from 1000 to 5, Burst from 1000 to 10 - Implement lazy REST mapper to avoid expensive CRD discovery - Use shared clientset across all handlers - Optimize pod cache with async initialization - Add namespace scoping to manager ## Enhanced Logging - Added 🔧 emoji marker for rate limiting config confirmation - Added 🚀 emoji marker for lazy REST mapper creation - Added ✅ emoji marker for successful initialization - Added 🔗 emoji marker for shared clientset creation - Added 🎯 emoji marker for optimized ListWatcher usage These logs make it easy to verify the fix is deployed and active. ## Root Cause In large clusters with 300+ CRDs, aggressive QPS (1000) caused 'too many requests' errors from K8s API server, breaking 'aenv service list' and other operations. ## Verification Look for these log markers on startup: - 🔧 API Rate Limiting configured: QPS=5, Burst=10 - 🚀 Creating lazy REST mapper - 🔗 Creating shared Kubernetes clientset - 🎯 Using optimized ListWatcher Fixes: aenv service list 500 error Co-Authored-By: Claude (claude-sonnet-4-5) <noreply@anthropic.com>
## Problem With QPS=5 and Burst=10, the shared rate limiter was too restrictive: - Pod reflector continuously retried list operations - Service list requests competed for the same QPS quota - Both operations failed with "too many requests" ## Solution Increase to QPS=20, Burst=40 - a more balanced approach that: - Allows background cache sync to proceed - Leaves headroom for user-initiated requests - Still conservative enough for large clusters ## Testing The eu126-sqa cluster has very high API server load. Previous QPS=5 was too low for even basic operations to succeed. Co-Authored-By: Claude (claude-sonnet-4-5) <noreply@anthropic.com>
## Problem API server may apply stricter rate limits to custom UserAgent strings. The "aenv-controller" UserAgent might be treated as a batch client. ## Solution Change UserAgent from "aenv-controller" to kubectl-compatible format: "kubectl/v1.26.0 (aenv-controller) kubernetes/compatible" This makes the controller appear as a standard kubectl client while maintaining identifiability via the parenthetical annotation. ## Hypothesis K8s API server may have per-UserAgent rate limiting policies where: - Standard kubectl clients get more lenient limits - Custom clients get stricter limits to prevent abuse ## Verification Look for updated UserAgent in logs: 🔧 API Rate Limiting configured: ... UserAgent=kubectl/v1.26.0... Co-Authored-By: Claude (claude-sonnet-4-5) <noreply@anthropic.com>
Revert UserAgent changes for analysis purposes. UserAgent change was proven to bypass APF rate limiting, but keeping original value to investigate CLI issues. Co-Authored-By: Claude (claude-sonnet-4-5) <noreply@anthropic.com>
## Bug
When API returns empty service list:
{"success": true, "code": 0, "data": []}
The condition 'api_response.success and api_response.data' evaluates
to False because empty list [] is falsy in Python.
This causes EnvironmentError with "Unknown error" message.
## Fix
Change condition from:
if api_response.success and api_response.data:
To:
if api_response.success:
Now empty list is treated as valid successful response.
## Impact
- aenv service list now works correctly when no services exist
- Returns "No running services found" instead of "Unknown error"
Fixes: CLI returning "Unknown error" for empty service list
Co-Authored-By: Claude (claude-sonnet-4-5) <noreply@anthropic.com>
- UserAgent rate limiting analysis - CLI empty list bug analysis and fix - Complete troubleshooting guides Co-Authored-By: Claude (claude-sonnet-4-5) <noreply@anthropic.com>
Summary of ChangesHello @JacksonMei, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly enhances the stability and efficiency of the Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces significant improvements to the controller's interaction with the Kubernetes API server, addressing performance and rate-limiting issues. Key changes include implementing a shared clientset for handlers, introducing a lazy REST mapper to optimize startup, and refining the pod cache mechanism. These changes should make the controller more resilient and performant, especially in large clusters. The PR also includes a crucial bug fix in the Python client for handling empty API responses and adds valuable troubleshooting documentation. The overall changes are excellent. I have a few minor suggestions for improvement.
| // Start async sync watcher | ||
| go func() { | ||
| klog.Infof("Waiting for pod cache sync (namespace: %s)...", namespace) | ||
| if !cache.WaitForCacheSync(stopCh, informer.HasSynced) { | ||
| klog.Errorf("failed to wait for pod cache sync in namespace %s", namespace) | ||
| return | ||
| } | ||
| klog.Infof("Pod cache sync completed (namespace: %s), number of pods: %d", namespace, len(podCache.cache.ListKeys())) | ||
| }() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changing the cache synchronization to be asynchronous is a great improvement for startup performance and resilience. However, this introduces a time window where the cache is not yet synced, and list operations might return incomplete data. It would be beneficial to expose the sync status, for example by adding an IsSynced() bool method to AEnvPodCache, so that callers like listPod can handle this state gracefully (e.g., by returning a '503 Service Unavailable' if the cache is not ready).
| // GET /env-instance/:id/list (id can be * for all) | ||
| list: (envName?: string) => | ||
| apiClient.get<EnvInstance[]>(`/env-instance/${envName || '*'}/list`), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comment for the list method in instanceApi is a bit misleading. It says // GET /env-instance/:id/list, but the implementation uses envName as the path parameter, not id. To improve clarity and prevent confusion during implementation, it would be best to update the comment to reflect the use of envName.
| // GET /env-instance/:id/list (id can be * for all) | |
| list: (envName?: string) => | |
| apiClient.get<EnvInstance[]>(`/env-instance/${envName || '*'}/list`), | |
| // GET /env-instance/:envName/list (envName can be * for all) | |
| list: (envName?: string) => | |
| apiClient.get<EnvInstance[]>(`/env-instance/${envName || '*'}/list`), |
| // GET /env-service/:id/list (id can be * for all) | ||
| list: (envName?: string) => | ||
| apiClient.get<EnvService[]>(`/env-service/${envName || '*'}/list`), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar to the instanceApi, the comment for the list method in serviceApi is inconsistent with the implementation. The comment mentions :id while the code uses :envName. Updating the comment will ensure the design document is accurate and clear for developers.
| // GET /env-service/:id/list (id can be * for all) | |
| list: (envName?: string) => | |
| apiClient.get<EnvService[]>(`/env-service/${envName || '*'}/list`), | |
| // GET /env-service/:envName/list (envName can be * for all) | |
| list: (envName?: string) => | |
| apiClient.get<EnvService[]>(`/env-service/${envName || '*'}/list`), |
|
LGTM |
No description provided.