-
Notifications
You must be signed in to change notification settings - Fork 41
e2e test for metrics #667
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
e2e test for metrics #667
Conversation
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
Welcome @blenbot! It looks like this is your first PR to volcano-sh/kthena 🎉 |
Summary of ChangesHello @blenbot, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request enhances the testing infrastructure by adding robust end-to-end tests for the router's metrics collection and rate limiting functionality. The new tests ensure that critical operational metrics, such as request counts, latencies, error rates, and rate limit exceedances, are accurately captured and reported, providing better observability and reliability for the routing layer. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This PR introduces e2e tests for router metrics and rate limit metrics. The implementation uses shared test functions for both standard and Gateway API setups, which is a good pattern. My review focuses on improving the robustness and maintainability of the new test code. I've suggested replacing fixed time.Sleep calls with polling mechanisms to avoid flaky tests, removing a redundant sleep, and refactoring duplicated code into a helper function.
| fetchMetrics := func() string { | ||
| resp, err := http.Get("http://127.0.0.1:9090/metrics") | ||
| require.NoError(t, err, "Failed to fetch metrics") | ||
| defer resp.Body.Close() | ||
| body, err := io.ReadAll(resp.Body) | ||
| require.NoError(t, err, "Failed to read metrics response") | ||
| return string(body) | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The fetchMetrics function is duplicated in TestMetricsShared and TestRateLimitMetricsShared (lines 964-971). The only difference is the port number. To improve maintainability and reduce code duplication, consider extracting this into a file-level helper function that accepts the port as a parameter. Also, http.Get does not have a timeout, which can make tests hang. Using an http.Client with a timeout is more robust.
Example:
func fetchMetricsFromPort(t *testing.T, port string) string {
client := http.Client{Timeout: 5 * time.Second}
resp, err := client.Get(fmt.Sprintf("http://127.0.0.1:%s/metrics", port))
require.NoError(t, err, "Failed to fetch metrics from port %s", port)
defer resp.Body.Close()
body, err := io.ReadAll(resp.Body)
require.NoError(t, err, "Failed to read metrics response")
return string(body)
}| time.Sleep(2 * time.Second) | ||
|
|
||
| metricsBody := fetchMetrics() | ||
|
|
||
| assert.Contains(t, metricsBody, "kthena_router_requests_total", "Request count metric should exist") | ||
| assert.Contains(t, metricsBody, fmt.Sprintf(`model="%s"`, modelRoute.Spec.ModelName), "Metric should have model label") | ||
| assert.Contains(t, metricsBody, `status_code="200"`, "Metric should have successful status code") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using time.Sleep to wait for metrics to become available can lead to flaky tests. If the system is slow, the test might fail. If the system is fast, the test waits longer than necessary. A better approach is to use a polling mechanism like require.Eventually to wait for the desired metrics to appear. This makes the test more robust and efficient.
This advice applies to other time.Sleep calls in this file that wait for metrics (e.g., lines 903, 916, and 997).
| time.Sleep(2 * time.Second) | |
| metricsBody := fetchMetrics() | |
| assert.Contains(t, metricsBody, "kthena_router_requests_total", "Request count metric should exist") | |
| assert.Contains(t, metricsBody, fmt.Sprintf(`model="%s"`, modelRoute.Spec.ModelName), "Metric should have model label") | |
| assert.Contains(t, metricsBody, `status_code="200"`, "Metric should have successful status code") | |
| require.Eventually(t, func() bool { | |
| metricsBody := fetchMetrics() | |
| return strings.Contains(metricsBody, "kthena_router_requests_total") && | |
| strings.Contains(metricsBody, fmt.Sprintf("model=\"%s\"", modelRoute.Spec.ModelName)) && | |
| strings.Contains(metricsBody, "status_code=\"200\"") | |
| }, 15*time.Second, time.Second, "Expected metrics for successful requests were not found in time") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@blenbot This is a good suggestion. You should fix this.
Also there are too many sleep, we should use the pattern AI suggested above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alright imma add that
| }) | ||
|
|
||
| // Wait for rate limiter to be configured | ||
| time.Sleep(3 * time.Second) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This time.Sleep appears to be redundant. The subsequent call to utils.CheckChatCompletions at line 979 likely has a retry mechanism to ensure the route is ready (as suggested by the comment on line 978), which makes this fixed sleep unnecessary. Removing it will make the test faster and more reliable by waiting for the actual condition instead of a fixed duration.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR adds end-to-end tests for metrics collection in the kthena router ecosystem. The tests verify routing metrics (request count, latency, error rate) and rate limiting metrics accuracy.
Changes:
- Added two shared test functions:
TestMetricsSharedandTestRateLimitMetricsSharedintest/e2e/router/shared.go - Added corresponding test wrappers in both router and gateway-api test suites to execute the shared tests with and without Gateway API configuration
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
| test/e2e/router/shared.go | Implements shared test functions for metrics verification: TestMetricsShared validates request count, latency, and error metrics; TestRateLimitMetricsShared validates rate limiting metrics |
| test/e2e/router/e2e_test.go | Adds TestMetrics and TestRateLimitMetrics wrappers that execute shared tests without Gateway API (no ParentRefs) |
| test/e2e/router/gateway-api/e2e_test.go | Adds TestMetrics and TestRateLimitMetrics wrappers that execute shared tests with Gateway API enabled (with ParentRefs) |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| assert.Contains(t, metricsBody, "kthena_router_requests_total", "Request count metric should exist") | ||
| assert.Contains(t, metricsBody, fmt.Sprintf(`model="%s"`, modelRoute.Spec.ModelName), "Metric should have model label") | ||
| assert.Contains(t, metricsBody, `status_code="200"`, "Metric should have successful status code") |
Copilot
AI
Jan 21, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The metric verification only checks if the metric name exists in the response string. This is a weak assertion that could pass even if the metric has a value of 0 or is not properly tracking the requests. Consider parsing the metrics response and verifying that the counter value is at least 3 (since 3 requests were made in the loop above).
| assert.Contains(t, metricsBody, "kthena_router_rate_limit_exceeded_total", "Rate limit exceeded metric should exist") | ||
| assert.Contains(t, metricsBody, fmt.Sprintf(`model="%s"`, modelRoute.Spec.ModelName), "Metric should have model label") |
Copilot
AI
Jan 21, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The metric verification only checks if the metric name and model label exist in the response. This doesn't verify that rate limiting actually occurred or that the metric value increased. Consider parsing the metrics response to verify that the kthena_router_rate_limit_exceeded_total counter value is greater than 0, since rate limiting should have been triggered based on the requests made above.
073b3bc to
0952862
Compare
Signed-off-by: blenbot <harshitiszz@gmail.com>
Signed-off-by: blenbot <harshitiszz@gmail.com>
0952862 to
81beff3
Compare
| func TestMetricsShared(t *testing.T, testCtx *routercontext.RouterTestContext, testNamespace string, useGatewayAPI bool, kthenaNamespace string) { | ||
| ctx := context.Background() | ||
|
|
||
| routerNamespace := kthenaNamespace |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't need to check and reset namespace.
Be consistent with other test cases.
| routerNamespace = "kthena-system" | ||
| } | ||
|
|
||
| // Setup port-forward to router pod for metrics endpoint |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't need to set portforward as well. We have already set portforward for 8080 by default.
| func TestRateLimitMetricsShared(t *testing.T, testCtx *routercontext.RouterTestContext, testNamespace string, useGatewayAPI bool, kthenaNamespace string) { | ||
| ctx := context.Background() | ||
|
|
||
| routerNamespace := kthenaNamespace |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto
| routerNamespace = "kthena-system" | ||
| } | ||
|
|
||
| // Setup port-forward to router pod for metrics endpoint |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto
| } | ||
|
|
||
| t.Run("VerifyRequestCountMetrics", func(t *testing.T) { | ||
| for i := 0; i < 3; i++ { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i would not suggest using the hardcore values,
| assert.Contains(t, metricsBody, `status_code="200"`, "Metric should have successful status code") | ||
| }) | ||
|
|
||
| t.Run("VerifyLatencyMetrics", func(t *testing.T) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/cc @blenbot
@YaoZengzeng Sir, I would really appreciate your guidance on how this test is actually checking or verifying the latency metrics. To me, this test and the one above it look mostly the same. Could you please point out what I might be missing?
What type of PR is this?
Implementng TestMetrics e2e test in kthena ecosystem.
What this PR does / why we need it:
Verify routing metrics collection (request count, latency, error rate)
Verify rate limiting metrics accuracy
Which issue(s) this PR fixes:
Part of #556
Special notes for your reviewer:
not for now