approval controller, metric collector controllers #6

Arvindthiru · 2025-12-06T01:07:25Z

This PR introduces a complete solution for automating approval decisions in KubeFleet staged rollouts based on workload health metrics from Prometheus.

What's Added:
Two Standalone Controllers:

Approval-Request-Controller (hub cluster): Watches ApprovalRequests/ClusterApprovalRequests, deploys MetricCollectors to member clusters, and auto-approves based on workload health
Metric-Collector (member clusters): Queries Prometheus for workload health metrics and reports back to the hub

Custom Resources:

MetricCollector: Defines what metrics to collect and where to report
MetricCollectorReport: Contains collected health metrics from member clusters
ClusterStagedWorkloadTracker: Specifies which workloads must be healthy before approval for ClusterStagedUpdateRun
StagedWorkloadTracker: Specifies which workloads must be healthy before approval for StagedUpdateRun

Documentation:

Main tutorial with complete end-to-end setup guide
Controller-specific READMEs
Example configurations for Prometheus, staged updates, and workload tracking
Automated installation scripts for both hub and member clusters

Testing
Tested with KubeFleet v0.1.2 on kind clusters (1 hub + 3 members)

Signed-off-by: Arvind Thirumurugan <arvindth@microsoft.com>

Copilot

Pull request overview

This PR introduces a comprehensive solution for automating approval decisions in KubeFleet staged rollouts based on workload health metrics from Prometheus. The implementation adds two standalone controllers (approval-request-controller on hub, metric-collector on members) and four custom resources to enable automated staged rollout approvals.

Key Changes:

Two standalone Kubernetes controllers for metric-based approval automation
Four new CRDs for metric collection and workload tracking
Complete documentation and installation scripts for both controllers
Integration with KubeFleet v0.1.2 for staged update orchestration

Reviewed changes

Copilot reviewed 64 out of 67 changed files in this pull request and generated 3 comments.

Show a summary per file

File	Description
approval-request-controller/go.mod	Module definition with invalid Go version 1.24.9
approval-request-controller/pkg/controller/controller.go	Main approval logic that watches ApprovalRequests and auto-approves based on metrics
approval-request-controller/apis/metric/v1alpha1/*.go	Custom resource type definitions for MetricCollector, Reports, and WorkloadTrackers
metric-collector/go.mod	Module definition with invalid Go version 1.24.9
metric-collector/pkg/controller/*.go	Member cluster controller for collecting Prometheus metrics
/docker/.Dockerfile	Container build files using invalid Go 1.24 base images
/install-on-.sh	Installation scripts for hub and member cluster deployments
/charts/	Helm charts for deploying both controllers
/examples/	Example configurations for Prometheus, CRPs, and workload trackers
README.md	Comprehensive tutorial covering setup, architecture, and usage

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

...troller-metric-collector/approval-request-controller/examples/prometheus/prometheus-crp.yaml

approval-controller-metric-collector/metric-collector/install-on-member.sh

approval-controller-metric-collector/README.md

britaniar · 2025-12-10T18:46:43Z

approval-controller-metric-collector/README.md

+
+# Set up clusters (creates 1 hub + 3 member clusters)
+export MEMBER_CLUSTER_COUNT=3
+make setup-clusters


Why choose this way to set up clusters instead of our guide (https://kubefleet.dev/docs/getting-started/kind/)? Technically when users approach this, they might have their own fleet already set up right? If they don't, they can use out guide to setup?

approval-controller-metric-collector/README.md

britaniar · 2025-12-10T19:21:39Z

...r-metric-collector/approval-request-controller/apis/metric/v1alpha1/metriccollector_types.go

+	// ReportNamespace is the namespace in the hub cluster where the MetricCollectorReport will be created.
+	// This should be the fleet-member-{clusterName} namespace.
+	// Example: fleet-member-cluster-1
+	// +required


Should we add validation to make sure it is fleet-member something similar like you have for the URL?

MetricColletor, MetricCollectorReport are internal APIs

...r-metric-collector/approval-request-controller/apis/metric/v1alpha1/metriccollector_types.go

britaniar · 2025-12-10T19:25:05Z

...r-metric-collector/approval-request-controller/apis/metric/v1alpha1/metriccollector_types.go

+	CollectedMetrics []WorkloadMetrics `json:"collectedMetrics,omitempty"`
+}
+
+// WorkloadMetrics represents metrics collected from a single workload pod.


Could you give an example of the metric with its labels so we can see how it is directly applied to each of the fields below?

weng271190436 · 2025-12-10T18:36:09Z

approval-controller-metric-collector/metric-collector/pkg/controller/collector.go

+	workloadMetrics := make([]localv1alpha1.WorkloadMetrics, 0, len(data.Result))
+	for _, res := range data.Result {
+		namespace := res.Metric["namespace"]
+		workloadName := res.Metric["app"]


if the user wants to user a different label key other than "app" it won't work? Wondering if it is a valid assumption that everyone will use "app"

weng271190436 · 2025-12-10T18:42:28Z

...ic-collector/approval-request-controller/apis/metric/v1alpha1/metriccollectorreport_types.go

+// Namespace: fleet-member-{clusterName} (extracted from CollectedMetrics[0].ClusterName)
+// Name: Same as MetricCollector name
+// All metrics in CollectedMetrics are guaranteed to have the same ClusterName.
+type MetricCollectorReport struct {


why doesn't this CR have spec and status? Feel like Conditions should be part of the Status and WorkloadsMonitored should be part of the spec

MetricCollectorReport is just a information source in the current implementation hence no desired state (spec) and no correspodning status

weng271190436 · 2025-12-10T18:43:00Z

...r-metric-collector/approval-request-controller/apis/metric/v1alpha1/workloadtracker_types.go

+// The name of this resource should match the name of the ClusterStagedUpdateRun it is used for.
+// For example, if the ClusterStagedUpdateRun is named "example-cluster-staged-run", the
+// ClusterStagedWorkloadTracker should also be named "example-cluster-staged-run".
+type ClusterStagedWorkloadTracker struct {


why doesn't this have any status?

Both WorkloadTracker objects are information sources for approval-controller, which allows users to specify which workload to track hence no spec/status

weng271190436 · 2025-12-10T18:43:06Z

...r-metric-collector/approval-request-controller/apis/metric/v1alpha1/workloadtracker_types.go

+// The name and namespace of this resource should match the name and namespace of the StagedUpdateRun it is used for.
+// For example, if the StagedUpdateRun is named "example-staged-run" in namespace "test-ns", the
+// StagedWorkloadTracker should also be named "example-staged-run" in namespace "test-ns".
+type StagedWorkloadTracker struct {


why doesn't this have any status?

#6 (comment)

weng271190436 · 2025-12-10T18:46:44Z

...st-controller/config/crd/bases/metric.kubernetes-fleet.io_clusterstagedworkloadtrackers.yaml

+            items:
+              description: WorkloadReference represents a workload to be tracked
+              properties:
+                name:


Should the workload names be part of the additional printer columns to make it easier for kubectl users to check this tracker?

weng271190436 · 2025-12-10T20:01:43Z

approval-controller-metric-collector/metric-collector/pkg/controller/collector.go

+	fullURL := fmt.Sprintf("%s?%s", queryURL, params.Encode())
+
+	// Create request
+	req, err := http.NewRequestWithContext(ctx, "GET", fullURL, nil)


this new request with context might not be respecting httpClient 30-second timeout?

weng271190436 · 2025-12-10T20:03:46Z

approval-controller-metric-collector/metric-collector/pkg/controller/collector.go

+		return nil, fmt.Errorf("Prometheus query failed: %s", result.Error)
+	}
+
+	return result.Data, nil


Is this already PrometheusData? And you don't need to return interface{} and type cast?

...ontroller-metric-collector/approval-request-controller/cmd/approvalrequestcontroller/main.go

weng271190436 · 2025-12-10T20:11:12Z

approval-controller-metric-collector/metric-collector/pkg/controller/controller.go

+	}
+
+	// Update status with reporting condition
+	if err := r.MemberClient.Status().Update(ctx, mc); err != nil {


Why don't we update the mc status one time near the end instead of updating twice?

The two status help us identiy tow different states,

Successful metric collection state -> update status

Successful creation of metriccollectorreport on hub -> update status

weng271190436 · 2025-12-10T20:12:07Z

approval-controller-metric-collector/metric-collector/pkg/controller/controller.go

+	}
+
+	// Create or update MetricCollectorReport on hub
+	report := &localv1alpha1.MetricCollectorReport{


Should this have an owner reference?

Signed-off-by: Arvind Thirumurugan <arvindth@microsoft.com>

michaelawyu

I've added some comments, PTAL

michaelawyu · 2025-12-11T14:17:54Z

...oller-metric-collector/approval-request-controller/apis/metric/v1alpha1/groupversion_info.go

+
+var (
+	// GroupVersion is group version used to register these objects
+	GroupVersion = schema.GroupVersion{Group: "metric.kubernetes-fleet.io", Version: "v1alpha1"}


Hi Arvind! Just a nit: I fear that the name (metric.kubernetes-fleet.io) might be a bit confusing.

michaelawyu · 2025-12-11T14:23:04Z

...r-metric-collector/approval-request-controller/apis/metric/v1alpha1/metriccollector_types.go

+
+	// ClusterName from the workload_health metric label.
+	// +required
+	ClusterName string `json:"clusterName"`


Hi Arvind! Will all the workloads in one object (or the same cluster) share the same value for this field?

michaelawyu · 2025-12-11T14:25:27Z

...r-metric-collector/approval-request-controller/apis/metric/v1alpha1/metriccollector_types.go

+
+// WorkloadMetrics represents metrics collected from a single workload pod.
+type WorkloadMetrics struct {
+	// Namespace is the namespace of the pod.


Hi Arvind! This comment mentions that the field is about the NS of a pod, but the Workload name field below has a comment that says the field is (typically) about a Deployment. Is this by design?

I am just a bit confused: so we are reporting per pod health but we list the parent of the pods (e.g., a Deployment) as well?

michaelawyu · 2025-12-11T14:28:57Z

...r-metric-collector/approval-request-controller/apis/metric/v1alpha1/workloadtracker_types.go

+)
+
+// WorkloadReference represents a workload to be tracked
+type WorkloadReference struct {


Hi Arvind! This is a NS/name only identifier, are we supporting only a specific API (e.g., Deployment)?

michaelawyu · 2025-12-11T14:29:18Z

...uest-controller/templates/crds/metric.kubernetes-fleet.io_clusterstagedworkloadtrackers.yaml

@@ -0,0 +1 @@
+../../../../config/crd/bases/metric.kubernetes-fleet.io_clusterstagedworkloadtrackers.yaml


Hi Arvind! Some nits: trailing empty lines.

Same for some other files below.

michaelawyu · 2025-12-11T15:06:22Z

approval-controller-metric-collector/metric-collector/cmd/metriccollector/main.go

+
+// buildHubConfig creates hub cluster config from environment variables
+// following the same pattern as member-agent
+func buildHubConfig() (*rest.Config, error) {


Hi Arvind! We had a sync about this; so this part could use some simplification I think.

michaelawyu · 2025-12-11T15:10:20Z

approval-controller-metric-collector/metric-collector/pkg/controller/collector.go

+	// Create Prometheus client without auth (simplified)
+	promClient := NewPrometheusClient(mc.Spec.PrometheusURL, "", nil)
+
+	query := buildPromQLQuery(mc)


Hi Arvind! I am a bit confused about this (and apologies if I missed anything), but this is just querying for a metric with a static name? But this does not seem to align with the API?

michaelawyu · 2025-12-11T15:13:55Z

...r-metric-collector/approval-request-controller/apis/metric/v1alpha1/metriccollector_types.go

+	// This should be the fleet-member-{clusterName} namespace.
+	// Example: fleet-member-cluster-1
+	// +required
+	ReportNamespace string `json:"reportNamespace"`


Hi Arvind! The whole purpose of this field is to let the member-side controller know where to write the metric collection result?

Recall that for security reasons member clients are restricted in what namespaces they could access in KubeFleet; the namespace is set when the environment is spun up, it's not really a variable per se.

michaelawyu · 2025-12-11T15:17:16Z

approval-controller-metric-collector/metric-collector/pkg/controller/controller.go

+			Message:            "Collector is configured",
+		})
+		meta.SetStatusCondition(&mc.Status.Conditions, metav1.Condition{
+			Type:               localv1alpha1.MetricCollectorConditionTypeCollecting,


Hi Arvind! A nit: this seems to be more aligned with Collected.

michaelawyu · 2025-12-11T15:18:57Z

approval-controller-metric-collector/metric-collector/pkg/controller/controller.go

+	if err := r.syncReportToHub(ctx, mc); err != nil {
+		klog.ErrorS(err, "Failed to sync MetricCollectorReport to hub", "metricCollector", req.Name)
+		meta.SetStatusCondition(&mc.Status.Conditions, metav1.Condition{
+			Type:               localv1alpha1.MetricCollectorConditionTypeReported,


Hi Arvind! I have to say I concur with Wei; the distinction between Collected and Reported seems to be a bit unwarranted. It's not wrong obviously, but...

michaelawyu · 2025-12-11T15:23:02Z

Hi Arvind! Just some of my two cents on the high level:

a) Arch-wise the design seems to be a bit too complex: for example, the whole metric data passing process can be done easily with one API but now it uses two separate APIs + the CRP/override API to complete the job.

michaelawyu · 2025-12-11T15:26:22Z

b) I understand that it's demo code so we want to focus more on the showcasing side, and that's probably the reason why in the code the controller is basically expecting one static metric (gauge type) from the host cluster -> but if that's the case we should be quite straightforward about this in the code and in the doc, and the API should get greatly simplified. Alternatively we could allow users to specific custom queries, which would make the code more useful (and more complex, of course)

michaelawyu · 2025-12-11T15:29:27Z

c) the folder structure could use some work. I feel that an organization like our main repo would be more comprehensible; currently everything is a bit scattered (with soft links connecting the duplicates), e.g., the APIs are all kept on the approval controller part. Doc wise I fear that for users without enough context they might find it difficult to grasp what the demo is really for.

Signed-off-by: Arvind Thirumurugan <arvindth@microsoft.com>

approval controller, metric collector controllers

7764719

Signed-off-by: Arvind Thirumurugan <arvindth@microsoft.com>

Arvindthiru force-pushed the approvalController branch from e49b943 to 7764719 Compare December 10, 2025 09:24

Arvind Thirumurugan added 2 commits December 10, 2025 02:03

minor fixes

fc87b21

Signed-off-by: Arvind Thirumurugan <arvindth@microsoft.com>

minor fixes

f3ea6d5

Signed-off-by: Arvind Thirumurugan <arvindth@microsoft.com>

Arvindthiru marked this pull request as ready for review December 10, 2025 10:38

Copilot AI review requested due to automatic review settings December 10, 2025 10:38

Copilot started reviewing on behalf of Arvindthiru December 10, 2025 10:39 View session

Copilot AI reviewed Dec 10, 2025

View reviewed changes

britaniar reviewed Dec 10, 2025

View reviewed changes

weng271190436 reviewed Dec 10, 2025

View reviewed changes

address minor comments

3c8db43

Signed-off-by: Arvind Thirumurugan <arvindth@microsoft.com>

Arvindthiru force-pushed the approvalController branch from c9bb19b to 3c8db43 Compare December 10, 2025 22:29

Arvind Thirumurugan added 5 commits December 10, 2025 16:47

address minor comment

e85a7a0

Signed-off-by: Arvind Thirumurugan <arvindth@microsoft.com>

address minor comment

23ac827

Signed-off-by: Arvind Thirumurugan <arvindth@microsoft.com>

add birdeye view section

018cacb

Signed-off-by: Arvind Thirumurugan <arvindth@microsoft.com>

add context about kind-clusters

b8448b3

Signed-off-by: Arvind Thirumurugan <arvindth@microsoft.com>

fix image

097e14a

Signed-off-by: Arvind Thirumurugan <arvindth@microsoft.com>

michaelawyu reviewed Dec 11, 2025

View reviewed changes

Arvind Thirumurugan added 4 commits December 11, 2025 15:29

ensure script works for all clusters

164a25d

Signed-off-by: Arvind Thirumurugan <arvindth@microsoft.com>

simplify metric-collector

cac604c

Signed-off-by: Arvind Thirumurugan <arvindth@microsoft.com>

simplify approval controller

69a8ed4

Signed-off-by: Arvind Thirumurugan <arvindth@microsoft.com>

update README.md

76de5f3

Signed-off-by: Arvind Thirumurugan <arvindth@microsoft.com>

		@@ -0,0 +1 @@
		../../../../config/crd/bases/metric.kubernetes-fleet.io_clusterstagedworkloadtrackers.yaml No newline at end of file

approval controller, metric collector controllers #6

Are you sure you want to change the base?

approval controller, metric collector controllers #6

Uh oh!

Conversation

Arvindthiru commented Dec 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

michaelawyu left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Arvindthiru commented Dec 6, 2025 •

edited

Loading