-
Notifications
You must be signed in to change notification settings - Fork 3
approval controller, metric collector controllers #6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: Arvind Thirumurugan <arvindth@microsoft.com>
e49b943 to
7764719
Compare
Signed-off-by: Arvind Thirumurugan <arvindth@microsoft.com>
Signed-off-by: Arvind Thirumurugan <arvindth@microsoft.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR introduces a comprehensive solution for automating approval decisions in KubeFleet staged rollouts based on workload health metrics from Prometheus. The implementation adds two standalone controllers (approval-request-controller on hub, metric-collector on members) and four custom resources to enable automated staged rollout approvals.
Key Changes:
- Two standalone Kubernetes controllers for metric-based approval automation
- Four new CRDs for metric collection and workload tracking
- Complete documentation and installation scripts for both controllers
- Integration with KubeFleet v0.1.2 for staged update orchestration
Reviewed changes
Copilot reviewed 64 out of 67 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| approval-request-controller/go.mod | Module definition with invalid Go version 1.24.9 |
| approval-request-controller/pkg/controller/controller.go | Main approval logic that watches ApprovalRequests and auto-approves based on metrics |
| approval-request-controller/apis/metric/v1alpha1/*.go | Custom resource type definitions for MetricCollector, Reports, and WorkloadTrackers |
| metric-collector/go.mod | Module definition with invalid Go version 1.24.9 |
| metric-collector/pkg/controller/*.go | Member cluster controller for collecting Prometheus metrics |
| /docker/.Dockerfile | Container build files using invalid Go 1.24 base images |
| /install-on-.sh | Installation scripts for hub and member cluster deployments |
| /charts/ | Helm charts for deploying both controllers |
| /examples/ | Example configurations for Prometheus, CRPs, and workload trackers |
| README.md | Comprehensive tutorial covering setup, architecture, and usage |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
...troller-metric-collector/approval-request-controller/examples/prometheus/prometheus-crp.yaml
Outdated
Show resolved
Hide resolved
approval-controller-metric-collector/metric-collector/install-on-member.sh
Outdated
Show resolved
Hide resolved
approval-controller-metric-collector/metric-collector/install-on-member.sh
Show resolved
Hide resolved
|
|
||
| # Set up clusters (creates 1 hub + 3 member clusters) | ||
| export MEMBER_CLUSTER_COUNT=3 | ||
| make setup-clusters |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why choose this way to set up clusters instead of our guide (https://kubefleet.dev/docs/getting-started/kind/)? Technically when users approach this, they might have their own fleet already set up right? If they don't, they can use out guide to setup?
| // ReportNamespace is the namespace in the hub cluster where the MetricCollectorReport will be created. | ||
| // This should be the fleet-member-{clusterName} namespace. | ||
| // Example: fleet-member-cluster-1 | ||
| // +required |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we add validation to make sure it is fleet-member something similar like you have for the URL?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
MetricColletor, MetricCollectorReport are internal APIs
...r-metric-collector/approval-request-controller/apis/metric/v1alpha1/metriccollector_types.go
Outdated
Show resolved
Hide resolved
| CollectedMetrics []WorkloadMetrics `json:"collectedMetrics,omitempty"` | ||
| } | ||
|
|
||
| // WorkloadMetrics represents metrics collected from a single workload pod. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you give an example of the metric with its labels so we can see how it is directly applied to each of the fields below?
| workloadMetrics := make([]localv1alpha1.WorkloadMetrics, 0, len(data.Result)) | ||
| for _, res := range data.Result { | ||
| namespace := res.Metric["namespace"] | ||
| workloadName := res.Metric["app"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if the user wants to user a different label key other than "app" it won't work? Wondering if it is a valid assumption that everyone will use "app"
| // Namespace: fleet-member-{clusterName} (extracted from CollectedMetrics[0].ClusterName) | ||
| // Name: Same as MetricCollector name | ||
| // All metrics in CollectedMetrics are guaranteed to have the same ClusterName. | ||
| type MetricCollectorReport struct { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why doesn't this CR have spec and status? Feel like Conditions should be part of the Status and WorkloadsMonitored should be part of the spec
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
MetricCollectorReport is just a information source in the current implementation hence no desired state (spec) and no correspodning status
| // The name of this resource should match the name of the ClusterStagedUpdateRun it is used for. | ||
| // For example, if the ClusterStagedUpdateRun is named "example-cluster-staged-run", the | ||
| // ClusterStagedWorkloadTracker should also be named "example-cluster-staged-run". | ||
| type ClusterStagedWorkloadTracker struct { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why doesn't this have any status?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Both WorkloadTracker objects are information sources for approval-controller, which allows users to specify which workload to track hence no spec/status
| // The name and namespace of this resource should match the name and namespace of the StagedUpdateRun it is used for. | ||
| // For example, if the StagedUpdateRun is named "example-staged-run" in namespace "test-ns", the | ||
| // StagedWorkloadTracker should also be named "example-staged-run" in namespace "test-ns". | ||
| type StagedWorkloadTracker struct { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why doesn't this have any status?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| items: | ||
| description: WorkloadReference represents a workload to be tracked | ||
| properties: | ||
| name: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should the workload names be part of the additional printer columns to make it easier for kubectl users to check this tracker?
| fullURL := fmt.Sprintf("%s?%s", queryURL, params.Encode()) | ||
|
|
||
| // Create request | ||
| req, err := http.NewRequestWithContext(ctx, "GET", fullURL, nil) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this new request with context might not be respecting httpClient 30-second timeout?
| return nil, fmt.Errorf("Prometheus query failed: %s", result.Error) | ||
| } | ||
|
|
||
| return result.Data, nil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this already PrometheusData? And you don't need to return interface{} and type cast?
...ontroller-metric-collector/approval-request-controller/cmd/approvalrequestcontroller/main.go
Outdated
Show resolved
Hide resolved
| } | ||
|
|
||
| // Update status with reporting condition | ||
| if err := r.MemberClient.Status().Update(ctx, mc); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why don't we update the mc status one time near the end instead of updating twice?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The two status help us identiy tow different states,
- Successful metric collection state -> update status
- Successful creation of metriccollectorreport on hub -> update status
| } | ||
|
|
||
| // Create or update MetricCollectorReport on hub | ||
| report := &localv1alpha1.MetricCollectorReport{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this have an owner reference?
Signed-off-by: Arvind Thirumurugan <arvindth@microsoft.com>
c9bb19b to
3c8db43
Compare
Signed-off-by: Arvind Thirumurugan <arvindth@microsoft.com>
Signed-off-by: Arvind Thirumurugan <arvindth@microsoft.com>
Signed-off-by: Arvind Thirumurugan <arvindth@microsoft.com>
Signed-off-by: Arvind Thirumurugan <arvindth@microsoft.com>
michaelawyu
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added some comments, PTAL
|
|
||
| var ( | ||
| // GroupVersion is group version used to register these objects | ||
| GroupVersion = schema.GroupVersion{Group: "metric.kubernetes-fleet.io", Version: "v1alpha1"} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi Arvind! Just a nit: I fear that the name (metric.kubernetes-fleet.io) might be a bit confusing.
|
|
||
| // ClusterName from the workload_health metric label. | ||
| // +required | ||
| ClusterName string `json:"clusterName"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi Arvind! Will all the workloads in one object (or the same cluster) share the same value for this field?
|
|
||
| // WorkloadMetrics represents metrics collected from a single workload pod. | ||
| type WorkloadMetrics struct { | ||
| // Namespace is the namespace of the pod. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi Arvind! This comment mentions that the field is about the NS of a pod, but the Workload name field below has a comment that says the field is (typically) about a Deployment. Is this by design?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am just a bit confused: so we are reporting per pod health but we list the parent of the pods (e.g., a Deployment) as well?
| ) | ||
|
|
||
| // WorkloadReference represents a workload to be tracked | ||
| type WorkloadReference struct { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi Arvind! This is a NS/name only identifier, are we supporting only a specific API (e.g., Deployment)?
| @@ -0,0 +1 @@ | |||
| ../../../../config/crd/bases/metric.kubernetes-fleet.io_clusterstagedworkloadtrackers.yaml No newline at end of file | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi Arvind! Some nits: trailing empty lines.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same for some other files below.
|
|
||
| // buildHubConfig creates hub cluster config from environment variables | ||
| // following the same pattern as member-agent | ||
| func buildHubConfig() (*rest.Config, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi Arvind! We had a sync about this; so this part could use some simplification I think.
| // Create Prometheus client without auth (simplified) | ||
| promClient := NewPrometheusClient(mc.Spec.PrometheusURL, "", nil) | ||
|
|
||
| query := buildPromQLQuery(mc) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi Arvind! I am a bit confused about this (and apologies if I missed anything), but this is just querying for a metric with a static name? But this does not seem to align with the API?
| // This should be the fleet-member-{clusterName} namespace. | ||
| // Example: fleet-member-cluster-1 | ||
| // +required | ||
| ReportNamespace string `json:"reportNamespace"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi Arvind! The whole purpose of this field is to let the member-side controller know where to write the metric collection result?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Recall that for security reasons member clients are restricted in what namespaces they could access in KubeFleet; the namespace is set when the environment is spun up, it's not really a variable per se.
| Message: "Collector is configured", | ||
| }) | ||
| meta.SetStatusCondition(&mc.Status.Conditions, metav1.Condition{ | ||
| Type: localv1alpha1.MetricCollectorConditionTypeCollecting, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi Arvind! A nit: this seems to be more aligned with Collected.
| if err := r.syncReportToHub(ctx, mc); err != nil { | ||
| klog.ErrorS(err, "Failed to sync MetricCollectorReport to hub", "metricCollector", req.Name) | ||
| meta.SetStatusCondition(&mc.Status.Conditions, metav1.Condition{ | ||
| Type: localv1alpha1.MetricCollectorConditionTypeReported, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi Arvind! I have to say I concur with Wei; the distinction between Collected and Reported seems to be a bit unwarranted. It's not wrong obviously, but...
|
Hi Arvind! Just some of my two cents on the high level: a) Arch-wise the design seems to be a bit too complex: for example, the whole metric data passing process can be done easily with one API but now it uses two separate APIs + the CRP/override API to complete the job. |
|
b) I understand that it's demo code so we want to focus more on the showcasing side, and that's probably the reason why in the code the controller is basically expecting one static metric (gauge type) from the host cluster -> but if that's the case we should be quite straightforward about this in the code and in the doc, and the API should get greatly simplified. Alternatively we could allow users to specific custom queries, which would make the code more useful (and more complex, of course) |
|
c) the folder structure could use some work. I feel that an organization like our main repo would be more comprehensible; currently everything is a bit scattered (with soft links connecting the duplicates), e.g., the APIs are all kept on the approval controller part. Doc wise I fear that for users without enough context they might find it difficult to grasp what the demo is really for. |
Signed-off-by: Arvind Thirumurugan <arvindth@microsoft.com>
Signed-off-by: Arvind Thirumurugan <arvindth@microsoft.com>
Signed-off-by: Arvind Thirumurugan <arvindth@microsoft.com>
Signed-off-by: Arvind Thirumurugan <arvindth@microsoft.com>
This PR introduces a complete solution for automating approval decisions in KubeFleet staged rollouts based on workload health metrics from Prometheus.
What's Added:
Two Standalone Controllers:
Custom Resources:
Documentation:
Main tutorial with complete end-to-end setup guide
Controller-specific READMEs
Example configurations for Prometheus, staged updates, and workload tracking
Automated installation scripts for both hub and member clusters
Testing
Tested with KubeFleet v0.1.2 on kind clusters (1 hub + 3 members)