Skip to content

Conversation

@blenbot
Copy link

@blenbot blenbot commented Jan 21, 2026

What type of PR is this?
Implementng TestMetrics e2e test in kthena ecosystem.

What this PR does / why we need it:
Verify routing metrics collection (request count, latency, error rate)
Verify rate limiting metrics accuracy

Which issue(s) this PR fixes:
Part of #556

Special notes for your reviewer:
not for now

Copilot AI review requested due to automatic review settings January 21, 2026 21:05
@volcano-sh-bot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign hzxuzhonghu for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@volcano-sh-bot
Copy link
Contributor

Welcome @blenbot! It looks like this is your first PR to volcano-sh/kthena 🎉

@gemini-code-assist
Copy link

Summary of Changes

Hello @blenbot, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request enhances the testing infrastructure by adding robust end-to-end tests for the router's metrics collection and rate limiting functionality. The new tests ensure that critical operational metrics, such as request counts, latencies, error rates, and rate limit exceedances, are accurately captured and reported, providing better observability and reliability for the routing layer.

Highlights

  • New E2E Metrics Tests: Introduced a new shared end-to-end test function, TestMetricsShared, to verify the collection of router metrics including request count, latency, and error rates. This test deploys a simple ModelRoute and asserts the presence and correctness of relevant Prometheus metrics.
  • New E2E Rate Limiting Metrics Tests: Added another shared end-to-end test function, TestRateLimitMetricsShared, specifically designed to validate the accuracy of rate limiting metrics. This test deploys a ModelRoute with rate limiting configured and verifies that the kthena_router_rate_limit_exceeded_total metric is correctly incremented for rate-limited requests.
  • Integration into E2E Suites: Both new shared test functions (TestMetricsShared and TestRateLimitMetricsShared) have been integrated into both the standard router e2e test suite (test/e2e/router/e2e_test.go) and the Gateway API enabled e2e test suite (test/e2e/router/gateway-api/e2e_test.go), ensuring comprehensive coverage for different deployment scenarios.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This PR introduces e2e tests for router metrics and rate limit metrics. The implementation uses shared test functions for both standard and Gateway API setups, which is a good pattern. My review focuses on improving the robustness and maintainability of the new test code. I've suggested replacing fixed time.Sleep calls with polling mechanisms to avoid flaky tests, removing a redundant sleep, and refactoring duplicated code into a helper function.

Comment on lines +871 to +878
fetchMetrics := func() string {
resp, err := http.Get("http://127.0.0.1:9090/metrics")
require.NoError(t, err, "Failed to fetch metrics")
defer resp.Body.Close()
body, err := io.ReadAll(resp.Body)
require.NoError(t, err, "Failed to read metrics response")
return string(body)
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The fetchMetrics function is duplicated in TestMetricsShared and TestRateLimitMetricsShared (lines 964-971). The only difference is the port number. To improve maintainability and reduce code duplication, consider extracting this into a file-level helper function that accepts the port as a parameter. Also, http.Get does not have a timeout, which can make tests hang. Using an http.Client with a timeout is more robust.

Example:

func fetchMetricsFromPort(t *testing.T, port string) string {
    client := http.Client{Timeout: 5 * time.Second}
    resp, err := client.Get(fmt.Sprintf("http://127.0.0.1:%s/metrics", port))
    require.NoError(t, err, "Failed to fetch metrics from port %s", port)
    defer resp.Body.Close()
    body, err := io.ReadAll(resp.Body)
    require.NoError(t, err, "Failed to read metrics response")
    return string(body)
}

Comment on lines +890 to +896
time.Sleep(2 * time.Second)

metricsBody := fetchMetrics()

assert.Contains(t, metricsBody, "kthena_router_requests_total", "Request count metric should exist")
assert.Contains(t, metricsBody, fmt.Sprintf(`model="%s"`, modelRoute.Spec.ModelName), "Metric should have model label")
assert.Contains(t, metricsBody, `status_code="200"`, "Metric should have successful status code")

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Using time.Sleep to wait for metrics to become available can lead to flaky tests. If the system is slow, the test might fail. If the system is fast, the test waits longer than necessary. A better approach is to use a polling mechanism like require.Eventually to wait for the desired metrics to appear. This makes the test more robust and efficient.

This advice applies to other time.Sleep calls in this file that wait for metrics (e.g., lines 903, 916, and 997).

Suggested change
time.Sleep(2 * time.Second)
metricsBody := fetchMetrics()
assert.Contains(t, metricsBody, "kthena_router_requests_total", "Request count metric should exist")
assert.Contains(t, metricsBody, fmt.Sprintf(`model="%s"`, modelRoute.Spec.ModelName), "Metric should have model label")
assert.Contains(t, metricsBody, `status_code="200"`, "Metric should have successful status code")
require.Eventually(t, func() bool {
metricsBody := fetchMetrics()
return strings.Contains(metricsBody, "kthena_router_requests_total") &&
strings.Contains(metricsBody, fmt.Sprintf("model=\"%s\"", modelRoute.Spec.ModelName)) &&
strings.Contains(metricsBody, "status_code=\"200\"")
}, 15*time.Second, time.Second, "Expected metrics for successful requests were not found in time")

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@blenbot This is a good suggestion. You should fix this.

Also there are too many sleep, we should use the pattern AI suggested above.

Copy link
Author

@blenbot blenbot Jan 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright imma add that

})

// Wait for rate limiter to be configured
time.Sleep(3 * time.Second)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This time.Sleep appears to be redundant. The subsequent call to utils.CheckChatCompletions at line 979 likely has a retry mechanism to ensure the route is ready (as suggested by the comment on line 978), which makes this fixed sleep unnecessary. Removing it will make the test faster and more reliable by waiting for the actual condition instead of a fixed duration.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds end-to-end tests for metrics collection in the kthena router ecosystem. The tests verify routing metrics (request count, latency, error rate) and rate limiting metrics accuracy.

Changes:

  • Added two shared test functions: TestMetricsShared and TestRateLimitMetricsShared in test/e2e/router/shared.go
  • Added corresponding test wrappers in both router and gateway-api test suites to execute the shared tests with and without Gateway API configuration

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

File Description
test/e2e/router/shared.go Implements shared test functions for metrics verification: TestMetricsShared validates request count, latency, and error metrics; TestRateLimitMetricsShared validates rate limiting metrics
test/e2e/router/e2e_test.go Adds TestMetrics and TestRateLimitMetrics wrappers that execute shared tests without Gateway API (no ParentRefs)
test/e2e/router/gateway-api/e2e_test.go Adds TestMetrics and TestRateLimitMetrics wrappers that execute shared tests with Gateway API enabled (with ParentRefs)

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +894 to +896
assert.Contains(t, metricsBody, "kthena_router_requests_total", "Request count metric should exist")
assert.Contains(t, metricsBody, fmt.Sprintf(`model="%s"`, modelRoute.Spec.ModelName), "Metric should have model label")
assert.Contains(t, metricsBody, `status_code="200"`, "Metric should have successful status code")
Copy link

Copilot AI Jan 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The metric verification only checks if the metric name exists in the response string. This is a weak assertion that could pass even if the metric has a value of 0 or is not properly tracking the requests. Consider parsing the metrics response and verifying that the counter value is at least 3 (since 3 requests were made in the loop above).

Copilot uses AI. Check for mistakes.
Comment on lines +1001 to +1002
assert.Contains(t, metricsBody, "kthena_router_rate_limit_exceeded_total", "Rate limit exceeded metric should exist")
assert.Contains(t, metricsBody, fmt.Sprintf(`model="%s"`, modelRoute.Spec.ModelName), "Metric should have model label")
Copy link

Copilot AI Jan 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The metric verification only checks if the metric name and model label exist in the response. This doesn't verify that rate limiting actually occurred or that the metric value increased. Consider parsing the metrics response to verify that the kthena_router_rate_limit_exceeded_total counter value is greater than 0, since rate limiting should have been triggered based on the requests made above.

Copilot uses AI. Check for mistakes.
blenbot added 2 commits January 22, 2026 02:54
Signed-off-by: blenbot <harshitiszz@gmail.com>
Signed-off-by: blenbot <harshitiszz@gmail.com>
func TestMetricsShared(t *testing.T, testCtx *routercontext.RouterTestContext, testNamespace string, useGatewayAPI bool, kthenaNamespace string) {
ctx := context.Background()

routerNamespace := kthenaNamespace
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't need to check and reset namespace.

Be consistent with other test cases.

routerNamespace = "kthena-system"
}

// Setup port-forward to router pod for metrics endpoint
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't need to set portforward as well. We have already set portforward for 8080 by default.

func TestRateLimitMetricsShared(t *testing.T, testCtx *routercontext.RouterTestContext, testNamespace string, useGatewayAPI bool, kthenaNamespace string) {
ctx := context.Background()

routerNamespace := kthenaNamespace
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

routerNamespace = "kthena-system"
}

// Setup port-forward to router pod for metrics endpoint
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

}

t.Run("VerifyRequestCountMetrics", func(t *testing.T) {
for i := 0; i < 3; i++ {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i would not suggest using the hardcore values,

assert.Contains(t, metricsBody, `status_code="200"`, "Metric should have successful status code")
})

t.Run("VerifyLatencyMetrics", func(t *testing.T) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/cc @blenbot
@YaoZengzeng Sir, I would really appreciate your guidance on how this test is actually checking or verifying the latency metrics. To me, this test and the one above it look mostly the same. Could you please point out what I might be missing?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants