e2e test for metrics #667

blenbot · 2026-01-21T21:05:31Z

What type of PR is this?
Implementng TestMetrics e2e test in kthena ecosystem.

What this PR does / why we need it:
Verify routing metrics collection (request count, latency, error rate)
Verify rate limiting metrics accuracy

Which issue(s) this PR fixes:
Part of #556

Special notes for your reviewer:
not for now

volcano-sh-bot · 2026-01-21T21:05:38Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign hzxuzhonghu for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

test/OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

volcano-sh-bot · 2026-01-21T21:05:44Z

Welcome @blenbot! It looks like this is your first PR to volcano-sh/kthena 🎉

gemini-code-assist · 2026-01-21T21:05:47Z

Summary of Changes

Hello @blenbot, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request enhances the testing infrastructure by adding robust end-to-end tests for the router's metrics collection and rate limiting functionality. The new tests ensure that critical operational metrics, such as request counts, latencies, error rates, and rate limit exceedances, are accurately captured and reported, providing better observability and reliability for the routing layer.

Highlights

New E2E Metrics Tests: Introduced a new shared end-to-end test function, TestMetricsShared, to verify the collection of router metrics including request count, latency, and error rates. This test deploys a simple ModelRoute and asserts the presence and correctness of relevant Prometheus metrics.
New E2E Rate Limiting Metrics Tests: Added another shared end-to-end test function, TestRateLimitMetricsShared, specifically designed to validate the accuracy of rate limiting metrics. This test deploys a ModelRoute with rate limiting configured and verifies that the kthena_router_rate_limit_exceeded_total metric is correctly incremented for rate-limited requests.
Integration into E2E Suites: Both new shared test functions (TestMetricsShared and TestRateLimitMetricsShared) have been integrated into both the standard router e2e test suite (test/e2e/router/e2e_test.go) and the Gateway API enabled e2e test suite (test/e2e/router/gateway-api/e2e_test.go), ensuring comprehensive coverage for different deployment scenarios.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This PR introduces e2e tests for router metrics and rate limit metrics. The implementation uses shared test functions for both standard and Gateway API setups, which is a good pattern. My review focuses on improving the robustness and maintainability of the new test code. I've suggested replacing fixed time.Sleep calls with polling mechanisms to avoid flaky tests, removing a redundant sleep, and refactoring duplicated code into a helper function.

gemini-code-assist · 2026-01-21T21:10:06Z

test/e2e/router/shared.go

+	fetchMetrics := func() string {
+		resp, err := http.Get("http://127.0.0.1:9090/metrics")
+		require.NoError(t, err, "Failed to fetch metrics")
+		defer resp.Body.Close()
+		body, err := io.ReadAll(resp.Body)
+		require.NoError(t, err, "Failed to read metrics response")
+		return string(body)
+	}


The fetchMetrics function is duplicated in TestMetricsShared and TestRateLimitMetricsShared (lines 964-971). The only difference is the port number. To improve maintainability and reduce code duplication, consider extracting this into a file-level helper function that accepts the port as a parameter. Also, http.Get does not have a timeout, which can make tests hang. Using an http.Client with a timeout is more robust.

Example:

func fetchMetricsFromPort(t *testing.T, port string) string { client := http.Client{Timeout: 5 * time.Second} resp, err := client.Get(fmt.Sprintf("http://127.0.0.1:%s/metrics", port)) require.NoError(t, err, "Failed to fetch metrics from port %s", port) defer resp.Body.Close() body, err := io.ReadAll(resp.Body) require.NoError(t, err, "Failed to read metrics response") return string(body) }

gemini-code-assist · 2026-01-21T21:10:06Z

test/e2e/router/shared.go

+		time.Sleep(2 * time.Second)
+
+		metricsBody := fetchMetrics()
+
+		assert.Contains(t, metricsBody, "kthena_router_requests_total", "Request count metric should exist")
+		assert.Contains(t, metricsBody, fmt.Sprintf(`model="%s"`, modelRoute.Spec.ModelName), "Metric should have model label")
+		assert.Contains(t, metricsBody, `status_code="200"`, "Metric should have successful status code")


Using time.Sleep to wait for metrics to become available can lead to flaky tests. If the system is slow, the test might fail. If the system is fast, the test waits longer than necessary. A better approach is to use a polling mechanism like require.Eventually to wait for the desired metrics to appear. This makes the test more robust and efficient.

This advice applies to other time.Sleep calls in this file that wait for metrics (e.g., lines 903, 916, and 997).

Suggested change

time.Sleep(2 * time.Second)

metricsBody := fetchMetrics()

assert.Contains(t, metricsBody, "kthena_router_requests_total", "Request count metric should exist")

assert.Contains(t, metricsBody, fmt.Sprintf(`model="%s"`, modelRoute.Spec.ModelName), "Metric should have model label")

assert.Contains(t, metricsBody, `status_code="200"`, "Metric should have successful status code")

require.Eventually(t, func() bool {

metricsBody := fetchMetrics()

return strings.Contains(metricsBody, "kthena_router_requests_total") &&

strings.Contains(metricsBody, fmt.Sprintf("model=\"%s\"", modelRoute.Spec.ModelName)) &&

strings.Contains(metricsBody, "status_code=\"200\"")

}, 15*time.Second, time.Second, "Expected metrics for successful requests were not found in time")

@blenbot This is a good suggestion. You should fix this.

Also there are too many sleep, we should use the pattern AI suggested above.

Alright imma add that

gemini-code-assist · 2026-01-21T21:10:06Z

test/e2e/router/shared.go

+	})
+
+	// Wait for rate limiter to be configured
+	time.Sleep(3 * time.Second)


This time.Sleep appears to be redundant. The subsequent call to utils.CheckChatCompletions at line 979 likely has a retry mechanism to ensure the route is ready (as suggested by the comment on line 978), which makes this fixed sleep unnecessary. Removing it will make the test faster and more reliable by waiting for the actual condition instead of a fixed duration.

Copilot

Pull request overview

This PR adds end-to-end tests for metrics collection in the kthena router ecosystem. The tests verify routing metrics (request count, latency, error rate) and rate limiting metrics accuracy.

Changes:

Added two shared test functions: TestMetricsShared and TestRateLimitMetricsShared in test/e2e/router/shared.go
Added corresponding test wrappers in both router and gateway-api test suites to execute the shared tests with and without Gateway API configuration

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

File	Description
test/e2e/router/shared.go	Implements shared test functions for metrics verification: TestMetricsShared validates request count, latency, and error metrics; TestRateLimitMetricsShared validates rate limiting metrics
test/e2e/router/e2e_test.go	Adds TestMetrics and TestRateLimitMetrics wrappers that execute shared tests without Gateway API (no ParentRefs)
test/e2e/router/gateway-api/e2e_test.go	Adds TestMetrics and TestRateLimitMetrics wrappers that execute shared tests with Gateway API enabled (with ParentRefs)

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-01-21T21:13:11Z

test/e2e/router/shared.go

+		assert.Contains(t, metricsBody, "kthena_router_requests_total", "Request count metric should exist")
+		assert.Contains(t, metricsBody, fmt.Sprintf(`model="%s"`, modelRoute.Spec.ModelName), "Metric should have model label")
+		assert.Contains(t, metricsBody, `status_code="200"`, "Metric should have successful status code")


The metric verification only checks if the metric name exists in the response string. This is a weak assertion that could pass even if the metric has a value of 0 or is not properly tracking the requests. Consider parsing the metrics response and verifying that the counter value is at least 3 (since 3 requests were made in the loop above).

Copilot · 2026-01-21T21:13:11Z

test/e2e/router/shared.go

+		assert.Contains(t, metricsBody, "kthena_router_rate_limit_exceeded_total", "Rate limit exceeded metric should exist")
+		assert.Contains(t, metricsBody, fmt.Sprintf(`model="%s"`, modelRoute.Spec.ModelName), "Metric should have model label")


The metric verification only checks if the metric name and model label exist in the response. This doesn't verify that rate limiting actually occurred or that the metric value increased. Consider parsing the metrics response to verify that the kthena_router_rate_limit_exceeded_total counter value is greater than 0, since rate limiting should have been triggered based on the requests made above.

Signed-off-by: blenbot <harshitiszz@gmail.com>

YaoZengzeng · 2026-01-22T03:00:04Z

test/e2e/router/shared.go

+func TestMetricsShared(t *testing.T, testCtx *routercontext.RouterTestContext, testNamespace string, useGatewayAPI bool, kthenaNamespace string) {
+	ctx := context.Background()
+
+	routerNamespace := kthenaNamespace


Don't need to check and reset namespace.

Be consistent with other test cases.

YaoZengzeng · 2026-01-22T03:00:47Z

test/e2e/router/shared.go

+		routerNamespace = "kthena-system"
+	}
+
+	// Setup port-forward to router pod for metrics endpoint


Don't need to set portforward as well. We have already set portforward for 8080 by default.

YaoZengzeng · 2026-01-22T03:01:05Z

test/e2e/router/shared.go

+func TestRateLimitMetricsShared(t *testing.T, testCtx *routercontext.RouterTestContext, testNamespace string, useGatewayAPI bool, kthenaNamespace string) {
+	ctx := context.Background()
+
+	routerNamespace := kthenaNamespace


YaoZengzeng · 2026-01-22T03:01:15Z

test/e2e/router/shared.go

+		routerNamespace = "kthena-system"
+	}
+
+	// Setup port-forward to router pod for metrics endpoint


katara-Jayprakash · 2026-01-22T13:54:03Z

test/e2e/router/shared.go

+	}
+
+	t.Run("VerifyRequestCountMetrics", func(t *testing.T) {
+		for i := 0; i < 3; i++ {


i would not suggest using the hardcore values,

katara-Jayprakash · 2026-01-22T13:59:59Z

test/e2e/router/shared.go

+		assert.Contains(t, metricsBody, `status_code="200"`, "Metric should have successful status code")
+	})
+
+	t.Run("VerifyLatencyMetrics", func(t *testing.T) {


/cc @blenbot
@YaoZengzeng Sir, I would really appreciate your guidance on how this test is actually checking or verifying the latency metrics. To me, this test and the one above it look mostly the same. Could you please point out what I might be missing?

Copilot AI review requested due to automatic review settings January 21, 2026 21:05

volcano-sh-bot requested review from YaoZengzeng and hzxuzhonghu January 21, 2026 21:05

volcano-sh-bot added the do-not-merge/contains-merge-commits label Jan 21, 2026

volcano-sh-bot added the size/L label Jan 21, 2026

Copilot started reviewing on behalf of blenbot January 21, 2026 21:06 View session

gemini-code-assist bot reviewed Jan 21, 2026

View reviewed changes

Copilot AI reviewed Jan 21, 2026

View reviewed changes

blenbot force-pushed the e2e_TestMetrics branch from 073b3bc to 0952862 Compare January 21, 2026 21:15

volcano-sh-bot removed the do-not-merge/contains-merge-commits label Jan 21, 2026

blenbot added 2 commits January 22, 2026 02:54

added testmetrics

a8dacae

Signed-off-by: blenbot <harshitiszz@gmail.com>

fixed TestMetrics

81beff3

Signed-off-by: blenbot <harshitiszz@gmail.com>

blenbot force-pushed the e2e_TestMetrics branch from 0952862 to 81beff3 Compare January 21, 2026 21:24

YaoZengzeng reviewed Jan 22, 2026

View reviewed changes

test/e2e/router/shared.go

routerNamespace = "kthena-system"

}

// Setup port-forward to router pod for metrics endpoint

Copy link

Member

YaoZengzeng Jan 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

katara-Jayprakash reviewed Jan 22, 2026

View reviewed changes

		assert.Contains(t, metricsBody, "kthena_router_rate_limit_exceeded_total", "Rate limit exceeded metric should exist")
		assert.Contains(t, metricsBody, fmt.Sprintf(`model="%s"`, modelRoute.Spec.ModelName), "Metric should have model label")

e2e test for metrics #667

Are you sure you want to change the base?

e2e test for metrics #667

Conversation

blenbot commented Jan 21, 2026

Uh oh!

volcano-sh-bot commented Jan 21, 2026

Uh oh!

volcano-sh-bot commented Jan 21, 2026

Uh oh!

gemini-code-assist bot commented Jan 21, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Jan 21, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jan 21, 2026

Choose a reason for hiding this comment

Uh oh!

YaoZengzeng Jan 22, 2026

Choose a reason for hiding this comment

Uh oh!

blenbot Jan 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jan 21, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Jan 21, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 21, 2026

Choose a reason for hiding this comment

Uh oh!

YaoZengzeng Jan 22, 2026

Choose a reason for hiding this comment

Uh oh!

YaoZengzeng Jan 22, 2026

Choose a reason for hiding this comment

Uh oh!

YaoZengzeng Jan 22, 2026

Choose a reason for hiding this comment

Uh oh!

YaoZengzeng Jan 22, 2026

Choose a reason for hiding this comment

Uh oh!

katara-Jayprakash Jan 22, 2026

Choose a reason for hiding this comment

Uh oh!

katara-Jayprakash Jan 22, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

blenbot Jan 22, 2026 •

edited

Loading