Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CP-23051: Change default kube-state-metrics behavior to use Cloudzero subchart #91

Merged
merged 15 commits into from
Nov 13, 2024

Conversation

bdrennz
Copy link
Contributor

@bdrennz bdrennz commented Nov 6, 2024

Description

This PR does the following:

  • enables the CloudZero KSM by default.
  • Modifies the KSM annotations so as not to be discoverable by other scrape jobs.

Testing

  • Test Case 1: Validate that the CloudZero KSM is deployed with the correct name:
Screenshot 2024-11-06 at 10 22 00 AM
  • Test Case 2: Validate that other observability services cannot scrape the CloudZero KSM, given the annotation change:
    To test this, I installed vanilla Prometheus alongside the Agent in a cluster and checked the Prometheus UI to ensure the Target didn't appear. This was in fact the case.

Checklist

  • I have added documentation for new/changed functionality in this PR
  • All active GitHub checks for tests, formatting, and security are passing
  • The correct base branch is being used, if not main

@bdrennz bdrennz requested a review from a team as a code owner November 6, 2024 18:23
@bdrennz bdrennz changed the title CP-23051: CP-23051: Change default kube-state-metrics behavior to use Cloudzero subchart Nov 6, 2024
Copy link
Collaborator

@dmepham dmepham left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm! this will be released as a beta version, right? will it be a minor or patch?

@conradcz
Copy link

conradcz commented Nov 6, 2024

lgtm! this will be released as a beta version, right? will it be a minor or patch?

I think a beta version is a good idea. Maybe OpenNet would be willing to give it a go for us.

Copy link
Contributor

@josephbarnett josephbarnett left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We also don't want to merge to develop, but a branch off of develop for the feature. This way we don't get this in the main branch until the beta is complete.

@bdrennz bdrennz changed the base branch from develop to cloudzero-ksm November 12, 2024 20:41
Copy link
Contributor

@wreckedred wreckedred left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you - lgtm

@bdrennz bdrennz merged commit 9771c62 into cloudzero-ksm Nov 13, 2024
3 of 4 checks passed
@bdrennz bdrennz deleted the cp-23051 branch November 13, 2024 14:57
dmepham pushed a commit that referenced this pull request Nov 20, 2024
… subchart (#91)

* override KSM name

* enable ksm by default

* make cloudzero ksm undiscoverable

* improve documentation

* option 2 is not the default behavior

* fix indentation

* add line

* add documentation for changing the service port for cloudzero ksm

* disable cloudzero KSM as scrape target

* set default port

* fix endpoint

* use default port

* add release notes

* remove metric exporter documentation

* change beta version
dmepham added a commit that referenced this pull request Nov 20, 2024
* CP-23051: Change default kube-state-metrics behavior to use Cloudzero subchart (#91)

* override KSM name

* enable ksm by default

* CP-23388: Define Static KubeStateMetrics Target Endpoint (#99)

* add 1.0.2 release doc file

---------

Co-authored-by: bdrennz <146774453+bdrennz@users.noreply.github.com>
dmepham added a commit that referenced this pull request Nov 21, 2024
* CP-23051: Change default kube-state-metrics behavior to use Cloudzero subchart (#91)

* override KSM name

* enable ksm by default

* CP-23388: Define Static KubeStateMetrics Target Endpoint (#99)

* add 1.0.2 release doc file

---------

Co-authored-by: bdrennz <146774453+bdrennz@users.noreply.github.com>
dmepham added a commit that referenced this pull request Jan 28, 2025
* CP-22731: add insights-controller chart (#97)

* CP-22731: include cz-insights-controller as subchart

* increase replicacount for tag server

* CP-22731: add beta testing

* update release process for insights controller

* update release workflow

* make most resources off by default

* update readme

* use global for secret names

* incorporate changes from 0.0.30-beta

* add beta release doc

* use local chart for testing

---------

Co-authored-by: josephbarnett <joe.barnett@cloudzero.com>

* CP-22730: use correct pattern list in config

* CP-22730: update doc check location to match normal release path (#100)

* Update Chart.yaml to version 1.0.0-beta

* use latest insights-controller

* CP-23435: remove duplicate service account name in insights-controller chart (#103)

* CP-23426: use insights-controller service account for init job (#104)

* CP-23465: increase default replica count for insights controller (#106)

* CP-23423: add release doc for 1.0.1-beta release (#107)

* [CP-23425] add default remote write retries (#108)

* CP-23425: set default max retries

* update init job to work with long running scrapes

* increase wait time for scrape endpoint

* default batch size added

* increase wait time for init job

* adjust remote write threshold, add default resource values

* Release 1.0.2-beta (#109)

* CP-23051: Change default kube-state-metrics behavior to use Cloudzero subchart (#91)

* override KSM name

* enable ksm by default

* CP-23388: Define Static KubeStateMetrics Target Endpoint (#99)

* add 1.0.2 release doc file

---------

Co-authored-by: bdrennz <146774453+bdrennz@users.noreply.github.com>

* move release doc to correct location

* Update Chart.yaml to version 1.0.2-beta

* CP-22730: package both charts in beta release (#110)

* CP-22730: fix artficat name (#111)

* CP-22730: fix doc path for github release publish (#112)

* CP-23740 (Feature/1.0.3 beta release): Validate KSM Metrics at Install (#116)

* remove unused metric

* add kubemetrics

* bump chart version for beta

* use dev tag for validator

* fix endpoint var name

* allow github to bump version

* simplify metric logic

* update tag

* use dev tag for chart

* [CP-23429] merge insights-controller into main chart (#117)

* insights-controller added to agent chart

* [CP-23428] add helm chart for creating cert (#118)

* CP-23428: add certificate helm chart

* update with documentation comments

* Update charts/cloudzero-agent/README.md

Co-authored-by: Becki Lee <becki.lee@cloudzero.com>

* Update charts/cloudzero-agent/README.md

remove duplicate entry

Co-authored-by: Becki Lee <becki.lee@cloudzero.com>

* Update charts/cloudzero-agent/README.md

add period to end of sentence in readme

Co-authored-by: Becki Lee <becki.lee@cloudzero.com>

* PR suggestion for readme

* update config example

---------

Co-authored-by: Becki Lee <becki.lee@cloudzero.com>

* CP 24028 add insights controller scape config (#120)

CP-24028: add scrape target for insights container
CP-22734: Bump insights image release version
Enhance README for helm repo management
Add release note for next beta version
Update release process for customer version numbers in betas

* Update Chart.yaml to version 1.0.0-beta-4

* CP 23892 add healthcheck (#121)

* CP-23892, CP-24009, CP-23959: release note
* add healthcheck support
* bump value of insights controller

* Update Chart.yaml to version 1.0.0-beta-5

* fix beta deploy script

* CP-24118: affinity settings, release notes (#122)

* CP-24118: add pod best effort affinity rule for distributing pod instances accross nodes
* allow override of KSM in configuration
* add next release notes
* bump version of controller and validator
* fix table in release note

* CP-23452 Add recommended installation skills to README (#124)

* CP-24008: forward insights controller app metrics (#125)

* CP-24389: deprecate unused chart (#126)

* CP-20221: Labels and Annotations (#127)

* bump final version of insights controller
* Adding release notes for 1.0.0 release
* Adding cert troubleshooting guide

---------

Co-authored-by: Becki Lee <becki.lee@cloudzero.com>

* publish material for beta-6 (#128)

* update readme, add extra svc names to cloudzero-cert, add cloudzero-cert chart publish (#129)

* [CP-24464] default to create self-signed cert upon chart install (#130)

* default to create self-signed cert upon chart install

* Update charts/cloudzero-agent/docs/releases/1.0.0-beta-7.md

Co-authored-by: Becki Lee <becki.lee@cloudzero.com>

* Update charts/cloudzero-agent/docs/releases/1.0.0-beta-7.md

Co-authored-by: Becki Lee <becki.lee@cloudzero.com>

* Update charts/cloudzero-agent/docs/releases/1.0.0-beta-7.md

Co-authored-by: Becki Lee <becki.lee@cloudzero.com>

* Update charts/cloudzero-agent/README.md

Co-authored-by: Becki Lee <becki.lee@cloudzero.com>

* Update charts/cloudzero-agent/README.md

Co-authored-by: Becki Lee <becki.lee@cloudzero.com>

* Update charts/cloudzero-agent/README.md

Co-authored-by: JB <josephbarnett@users.noreply.github.com>

* Update charts/cloudzero-agent/README.md

Co-authored-by: JB <josephbarnett@users.noreply.github.com>

---------

Co-authored-by: Becki Lee <becki.lee@cloudzero.com>
Co-authored-by: JB <josephbarnett@users.noreply.github.com>

* enable new metric for insights controller failures (#132)

* CP-24424: change init scrape job to use new -backfill option (#131)

Previously, the scrape job would use curl to hit a /scrape HTTP
endpoint on the webhook server. This was problematic on larger clusters
where the operation takes a long time since the HTTP context was
getting cancelled before the operation completed.

This patch switches to using a new -backfill option on the controller
binary, which causes the binary to run the backfiller (née scraper) and
exit instead of acting as an HTTPd.

* remove certificate chart from beta workflow (#133)

* Update Chart.yaml to version 1.0.0-beta-7

* add back missing packaging (#134)

* add upgrade command to beta-7 release notes (#135)

* CP-24743: allow all resources to use imagePullSecrets (#136)

* CP-24743: add imagePullSecrets to cert job

* Update Chart.yaml to version 1.0.0-beta-8

* CP-24792: allow more configurable settings, increase default remote write timeout (#137)

* CP-24792: allow more configurable settings, increase default remote write timeout

* CP-24792: add KSM image info for easy identification of images to mirror for private image registries (#139)

* CP-24792: add KSM image info for easy identification of images to mirror to private repos

* add template command for finding images

---------

Co-authored-by: Becki Lee <becki.lee@cloudzero.com>

* CP-24833: template KSM service address using the release name (#140)

* Update Chart.yaml to version 1.0.0-beta-9

* CP-24886: ensure KSM service and KSM target always match (#143)

* CP-24886: ensure ksm svc and target match

* Update NOTES.txt

---------

Co-authored-by: Thomas Evans <teevans@users.noreply.github.com>

* Update Chart.yaml to version 1.0.0-beta-10

* Add server.agentMode boolean configuration option

This just provides a convenient way to toggle agent mode on/off for
debugging, which is valuable since agent mode disables a *lot* of
Prometheus functionality which can be very useful for debugging, such
as the /graph endpoint.

* Add metric_relabel_configs to insights controller scrape job.

This should just restrict the metrics to those we're interested in,
as defined in values.yaml.

* CP-23129: add Prometheus scrape job to scrape metrics from itself

I also switched from a hardcoded value it to using
`prometheusConfig.scrapeJobs.kubeStateMetrics.scrapeInterval` for the
KSM job scrape_interval. This seems to pretty clearly be the intent
of the configuration option, but it was not being used. Notably, this
increases the interval from 1m to 2m.

* [CP-24912] use image tag and chart name in init job name (#144)

* always use insightsController image reference in init scrape job name

* CP-24655: use backfill instead of scrape for init job that gathers existing state (#145)

* CP-25115: add release notes for 1.0.0-rc1 release (#147)

* CP-24655: add release nodes for RC1

* fix main chart release in rel branch (#151)

* CP-25165: allow user to choose release branch in main chart release (#152)

* CP-25165: checkout given branch (#153)

* CP-25165: checkout given branch in correct order (#154)

* CP-25165: checkout the input branch, not main (#155)

* Basic install success message. (#149)

* Update charts/cloudzero-agent/Chart.yaml

Co-authored-by: JB <josephbarnett@users.noreply.github.com>

* CP-25270: prepare release/1.0.0 for merging (#158)

* update docs, remove cert-manager references from test, add missing quote

---------

Co-authored-by: josephbarnett <joe.barnett@cloudzero.com>
Co-authored-by: Automated CZ Release <ops@cloudzero.com>
Co-authored-by: bdrennz <146774453+bdrennz@users.noreply.github.com>
Co-authored-by: Becki Lee <becki.lee@cloudzero.com>
Co-authored-by: JB <josephbarnett@users.noreply.github.com>
Co-authored-by: evan-cz <evan.nemerson@cloudzero.com>
Co-authored-by: Thomas Evans <teevans@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants