Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[dev-v2.9] Add ServiceMonitor for scraping fleet-controller in rancher-monitoring chart #3949

Merged
merged 3 commits into from
Jun 6, 2024

Conversation

p-se
Copy link
Contributor

@p-se p-se commented May 21, 2024

Issue:

The direct issue is this one: rancher/fleet#2295
The whole story is here: rancher/fleet#1408

The PR that introduced metrics into fleet: rancher/fleet#2172

The changes have been merged into Fleet v0.10.0-rc.13. Fleet 0.10 is planned to be released with Rancher 2.9.

Problem

Enabling further additions to monitoring that are related to the newly introduced fleet metrics, for which reasons Prometheus needs to scrape the data of the fleet-controllers by creating an additional ServiceMonitor which points to the Kubernetes services created by the fleet chart, which in turn point to the fleet-controller metrics.

Solution

An additional ServiceMonitor needs to be created when the rancher-monitoring chart is installed, so that the thereby installed Prometheus instance is automatically configured to scrape the data of the fleet-controllers.

This enables further additions of monitoring capabilities to Rancher using the rancher-monitoring chart, for instance the addition of Prometheus alerts or Grafana dashboards. The latter may be embedded into Rancher, similarly as the Grafana dashboards are already embedded into Rancher and displayed through the Rancher UI when the rancher-monitoring chart is installed.

Testing

  • On a cluster with Rancher and a fleet version >= v0.10.0-rc13, install the rancher-monitoring chart that includes the changes of this PR.

  • Open the Prometheus UI, navigate to Targets and check for fleet-controller.
    image

  • If metrics are to be tested with sharding in Fleet enabled, which also is a feature introduced first in v0.10.0-rc.13, make sure you use a fleet version which has metrics: make sure metrics work well with sharding fleet#2420 integrated, which, at the time of writing is not yet in an RC of fleet. Also, fleet needs to be deployed with sharding enabled as described in the fleet-docs.

Engineering Testing

Manual Testing

Performed as described in Testing, including testing with sharding enabled in fleet.

Automated Testing

The initial PR adds E2E tests that check the fleet-controller exposed metrics through the helm chart generated services (when fleet is installed). Those tests do not cover the usage of a ServiceMonitor as introduced in this PR. Further PRs have followed to extend and improve testing of metrics in fleet:

QA Testing Considerations

Regressions Considerations

The probability of this change introducing regressions is low, as it simply extends already implemented functionality by a rather simple resource, which is part of the rancher-monitoring-crd chart.

For some more context, the ServiceMonitor is a custom Kubernetes resources and part of the prometheus-operator controller. The controller looks at the resource and configures Prometheus to scrape an additional target, which in this case will be fleet. If anything inside this resource is wrong, it is not expected to have an effect on any other resources of the same kind. It would be surprising to see that scraping these amounts of additional metrics would have a significant performance impact, but looking at it long-term could potentially increase the storage space required for storing metrics. That said, Prometheus is by default configured to retain the data for only 15 days (and a default retention size in the rancher-monitoring chart of 50G), so that this aspect should also be negligible. The scraped metrics could potentially conflict with other metrics and cause a mess, for which reason they are prefixed with fleet_, making conflicts virtually impossible.

Backporting considerations

This change does not need to be backported to other versions. This is a new feature in fleet and no plans exist to backport it.

The probability of this change introducing regressions is low, as it simply extends already implemented functionality by a rather simple resource, which is part of the rancher-monitoring-crd chart.

Backporting considerations

This change does not need to be backported to other versions. This is a new feature in fleet and no plans exist to backport it.

@p-se p-se self-assigned this May 21, 2024
@p-se p-se requested a review from a team as a code owner May 21, 2024 08:13
Copy link

Validation steps

  • Ensure all container images have repository and tag on the same level to ensure that all container images are included in rancher-images.txt which are used by airgap customers.
  Ex:-
    longhorn-controller:
      repository: rancher/hardened-sriov-cni
      tag: v2.6.3-build20230913
  
  • Add a 👍 (thumbs up) reaction to this comment once done. CI won't pass without this reaction to the github-action bot's latest validation comment.
  • Approve the PR to run the CI check.

Copy link

Validation steps

  • Ensure all container images have repository and tag on the same level to ensure that all container images are included in rancher-images.txt which are used by airgap customers.
  Ex:-
    longhorn-controller:
      repository: rancher/hardened-sriov-cni
      tag: v2.6.3-build20230913
  
  • Add a 👍 (thumbs up) reaction to this comment once done. CI won't pass without this reaction to the github-action bot's latest validation comment.
  • Approve the PR to run the CI check.

Copy link

Validation steps

  • Ensure all container images have repository and tag on the same level to ensure that all container images are included in rancher-images.txt which are used by airgap customers.
  Ex:-
    longhorn-controller:
      repository: rancher/hardened-sriov-cni
      tag: v2.6.3-build20230913
  
  • Add a 👍 (thumbs up) reaction to this comment once done. CI won't pass without this reaction to the github-action bot's latest validation comment.
  • Approve the PR to run the CI check.

Copy link

github-actions bot commented Jun 6, 2024

Validation steps

  • Ensure all container images have repository and tag on the same level to ensure that all container images are included in rancher-images.txt which are used by airgap customers.
  Ex:-
    longhorn-controller:
      repository: rancher/hardened-sriov-cni
      tag: v2.6.3-build20230913
  
  • Add a 👍 (thumbs up) reaction to this comment once done. CI won't pass without this reaction to the github-action bot's latest validation comment.
  • Approve the PR to run the CI check.

@p-se p-se changed the title Add ServiceMonitor for scraping fleet-controller in rancher-monitoring chart [dev-2.9] Add ServiceMonitor for scraping fleet-controller in rancher-monitoring chart Jun 6, 2024
@p-se p-se changed the title [dev-2.9] Add ServiceMonitor for scraping fleet-controller in rancher-monitoring chart [dev-v2.9] Add ServiceMonitor for scraping fleet-controller in rancher-monitoring chart Jun 6, 2024
@thehejik
Copy link

thehejik commented Jun 6, 2024

Test report

Apart that rancher-monitoring 104.0.0-rc1+up45.31.1 is broken and cannot be directly installed into Rancher v2.9-b456233ab32b27b221d14244df7b0223eacfe078-head the PR works as expected. I could see the fleet-controller target under Prometheus.
image

As a workaround I did an upgrade from previous version 103.1.0+up45.31.1 which worked.

Environment

  • single node k3s v1.28.6+k3s2 local cluster with rancher v2.9-b456233ab32b27b221d14244df7b0223eacfe078-head
  • Fleet version: fleet:104.0.0+up0.10.0-rc.14

Test

  • Enabled Include Prerelease versions in Rancher Preferences
  • Add App repository from this PR. git repo: https://github.com/p-se/rancher-charts.git branch: add-fleet-smon
  • First install monitoring 103.1.0+up45.31.1 then upgrade to 104.0.0-rc1+up45.31.1 from the Repository defined above.
  • on Local go to Monitoring -> Prometheus Targets and after a while a new metrics endpoint for fleet will appear there

Copy link

github-actions bot commented Jun 6, 2024

Validation steps

  • Ensure all container images have repository and tag on the same level to ensure that all container images are included in rancher-images.txt which are used by airgap customers.
  Ex:-
    longhorn-controller:
      repository: rancher/hardened-sriov-cni
      tag: v2.6.3-build20230913
  
  • Add a 👍 (thumbs up) reaction to this comment once done. CI won't pass without this reaction to the github-action bot's latest validation comment.
  • Approve the PR to run the CI check.

@thehejik
Copy link

thehejik commented Jun 6, 2024

The installation problem of 104 has been fixed by #4026

Now I could install the version directly and the fleet target is there.

@thehejik thehejik merged commit c74ab29 into rancher:dev-v2.9 Jun 6, 2024
6 checks passed
skanakal pushed a commit to skanakal/charts that referenced this pull request Jun 7, 2024
@p-se
Copy link
Contributor Author

p-se commented Jun 21, 2024

Relates to rancher/fleet#2460

krunalhinguu pushed a commit to krunalhinguu/charts that referenced this pull request Jul 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants