Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CP-26005: add annotations to init-cert, add better logic for updating certificates #167

Merged
merged 12 commits into from
Feb 17, 2025
20 changes: 20 additions & 0 deletions charts/cloudzero-agent/docs/releases/1.0.0-rc4.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
## [Release 1.0.0-rc4](https://github.com/Cloudzero/cloudzero-agent/compare/v1.0.0-rc3...v1.0.0-rc4) (2025-02-16)

This release makes improvements to the certificate initialization Job so that more invalid states can be rectified. Additionally, annotations can now be added to initialization Jobs. Expiration of both initialization Jobs is not configurable.

### Upgrade Steps

Upgrade using the following command:
```console
helm upgrade --install <RELEASE_NAME> cloudzero/cloudzero-agent -n <NAMESPACE> --create-namespace -f configuration.example.yaml --version 1.0.0-rc4
```

See [upgrades.md](../upgrades.md) for full documentation of upgrade behavior as it relates to initialization Jobs.

### Improvements
* **Certificate Initialization Job Checks For More Invalid Conditions:** The certificate initialization job now checks for certificates with invalid SAN settings, mismatches between webhook configurations, and mismatches between the webhook `caBundle` value and the `ca.crt` value in the TLS secret.

* **Automatic Job Cleanup Configuration:** TTL for both initialization Jobs is now configurable, and defaults to 180 seconds.

* **Initialization Job Annotation Support:** Both initialization Jobs allow the user to set annotations. This was specifically added to make management via ArgoCD easier, as ArgoCD will consider expired Jobs to be OutOfSync with the release source. See [upgrades.md](../upgrades.md) for details on recommended annotations.

34 changes: 26 additions & 8 deletions charts/cloudzero-agent/docs/upgrades.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,17 +4,35 @@ This document outlines the expected behavior of the **CloudZero Agent Jobs** und

## **Jobs Overview**
The Helm chart deploys two Jobs:
- **`backfill`**: Runs during every version upgrade or container image change. It ensures that the current state of the cluster is captured.
- **`init-cert`**: Generates or renews the internal certificate needed for communication between Kubernetes and the webhook server.
- **`backfill`**: Ensures that the current state of the cluster is captured and uploaded to the CloudZero platform.
- **`init-cert`**: Generates or renews the internal certificate needed for communication between Kubernetes and the webhook server.

Both the **`backfill`** and **`init-cert`** Jobs expire after a configurable period of time, ensuring that re-initialization can occur on changes to the chart.

---

## **Upgrade Scenarios and Job Behavior**
| Upgrade Scenario | `backfill` Job Behavior | `init-cert` Job Behavior |
|-----------------------------|------------------------|--------------------------|
| **Standard version upgrade** | Runs on every version/image upgrade | Runs every upgrade but only generates a certificate if needed |
| **Standard version upgrade** | Runs on every version/image upgrade | Runs on every version/image upgrade |
| **Forced upgrade (`--force`)** | Runs again after the Job is automatically deleted | Always runs and ensures a new certificate is created |
| **Upgrade Without Chart Version Change** | Does not run | Runs again, but does not regenerate certificate |
| **Upgrade Without Chart Version Change** | Runs again if Job TTL has expired | Runs again if Job TTL has expired |


---
## **ArgoCD Integration**
If installing this Helm chart using ArgoCD, set the following annotations in the `initBackfillJob` and `initCertJob` fields to ensure that ArgoCD does not constantly consider the Application out of sync:
```yaml
initBackfillJob:
annotations:
argocd.argoproj.io/hook: PostSync
argocd.argoproj.io/hook-delete-policy: HookSucceeded
initCertJob:
annotations:
argocd.argoproj.io/hook: PostSync
argocd.argoproj.io/hook-delete-policy: HookSucceeded
```
See the ArgoCD [Hook Deletion](https://argo-cd.readthedocs.io/en/stable/user-guide/resource_hooks/#hook-deletion-policies) documentation for further details.

---
## **Common Issues & Troubleshooting**
Expand All @@ -28,7 +46,7 @@ Error: UPGRADE FAILED: failed to replace object: Job.batch "cloudzero-agent-back
```
**Solution:**
1. **Wait for the `backfill` Job to complete**
- The Job will be **automatically deleted after 60 seconds** (`ttlSecondsAfterFinished`).
- The Job will be **automatically deleted after 180 seconds** (`ttlSecondsAfterFinished`).
- Once the Job is removed, retry the Helm upgrade.

2. **Manually delete the running Jobs and retry the upgrade**
Expand All @@ -40,8 +58,8 @@ Error: UPGRADE FAILED: failed to replace object: Job.batch "cloudzero-agent-back

---
## **Implementation Notes**
1. **`backfill` Job Cleanup:** The Job includes a `ttlSecondsAfterFinished: 60` to automatically remove itself.
2. **`init-cert` Job Cleanup:** Uses `ttlSecondsAfterFinished: 3600` for cleanup.
1. **`backfill` Job Cleanup:** The Job includes a `ttlSecondsAfterFinished: 180` to automatically remove itself.
2. **`init-cert` Job Cleanup:** Uses `ttlSecondsAfterFinished: 180` for cleanup.
3. **Deployment Rollout:** The `init-cert` Job includes a mechanism to force a Deployment restart when it completes.

---

26 changes: 24 additions & 2 deletions charts/cloudzero-agent/templates/_helpers.tpl
Original file line number Diff line number Diff line change
Expand Up @@ -130,6 +130,17 @@ Create the name of the ClusterRoleBinding to use for the init-cert Job
{{ .Values.initCertJob.rbac.clusterRoleBinding | default $defaultName }}
{{- end -}}

{{/*
init-cert Job annotations
*/}}
{{- define "cloudzero-agent.initCertJob.annotations" -}}
{{- if .Values.initCertJob.annotations -}}
annotations:
{{- toYaml .Values.initCertJob.annotations | nindent 2 -}}
{{- end -}}
{{- end -}}


{{/*
Create a fully qualified Prometheus server name.
We truncate at 63 chars because some Kubernetes name fields are limited to this (by the DNS naming spec).
Expand Down Expand Up @@ -396,12 +407,23 @@ Name for the backfill job resource
{{- printf "%s-%s" $name ($imageRef | trunc 6) | trunc 61 | replace "." "-" -}}
{{- end }}

{{/*
initBackfillJob Job annotations
*/}}
{{- define "cloudzero-agent.initBackfillJob.annotations" -}}
{{- if .Values.initBackfillJob.annotations -}}
annotations:
{{- toYaml .Values.initBackfillJob.annotations | nindent 2 -}}
{{- end -}}
{{- end -}}

{{/*
Name for the certificate init job resource. Should be a new name each installation/upgrade.
*/}}
{{- define "cloudzero-agent.initCertJobName" -}}
{{- $name := (printf "%s-init-cert" (include "cloudzero-agent.insightsController.server.webhookFullname" .) | trunc 60) -}}
{{- $name -}}-{{ .Release.Revision | default (randAlpha 5) }}
{{ $version := .Chart.Version | replace "." "-" }}
{{- $name := (printf "%s-init-cert-%s" (include "cloudzero-agent.insightsController.server.webhookFullname" .) $version | trunc 60) -}}
{{- $name -}}-{{ .Release.Revision }}
{{- end }}

{{/*
Expand Down
43 changes: 33 additions & 10 deletions charts/cloudzero-agent/templates/init-job.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -6,10 +6,11 @@ kind: Job
metadata:
name: {{ include "cloudzero-agent.initBackfillJobName" . }}
namespace: {{ .Release.Namespace }}
{{- include "cloudzero-agent.initBackfillJob.annotations" . | nindent 2 }}
labels:
{{- include "cloudzero-agent.insightsController.labels" . | nindent 4 }}
spec:
ttlSecondsAfterFinished: 60
ttlSecondsAfterFinished: {{ $backFillValues.ttlSecondsAfterFinished }}
template:
metadata:
name: {{ include "cloudzero-agent.initBackfillJobName" . }}
Expand Down Expand Up @@ -85,10 +86,11 @@ kind: Job
metadata:
name: {{ include "cloudzero-agent.initCertJobName" . }}
namespace: {{ .Release.Namespace }}
{{- include "cloudzero-agent.initCertJob.annotations" . | nindent 2 }}
labels:
{{- include "cloudzero-agent.insightsController.labels" . | nindent 4 }}
spec:
ttlSecondsAfterFinished: 3600
ttlSecondsAfterFinished: {{ .Values.initCertJob.ttlSecondsAfterFinished }}
template:
metadata:
name: {{ include "cloudzero-agent.initCertJobName" . }}
Expand All @@ -109,27 +111,48 @@ spec:
set -e

{{- if not .Values.insightsController.tls.useCertManager }}
# Determine if the ValidatingWebhookConfiguration resources already have caBundle information
MISSING_CA_BUNDLE=false
GENERATE_CERTIFICATE=false

# Check if the caBundle in the ValidatingWebhookConfiguration is the same for all webhooks
caBundles=()
{{- range $configType, $configs := .Values.insightsController.webhooks.configurations }}
{{- $webhookName := printf "%s-%s" (include "cloudzero-agent.validatingWebhookConfigName" $) $configType }}
{{- if or (index $.Values.insightsController.labels.resources $configType) (index $.Values.insightsController.annotations.resources $configType) }}
CA_BUNDLE=$(kubectl get validatingwebhookconfiguration {{ $webhookName }} -o jsonpath='{.webhooks[0].clientConfig.caBundle}')
if [[ -z "$CA_BUNDLE" ]]; then
MISSING_CA_BUNDLE=true
fi
caBundles+=($(kubectl get validatingwebhookconfiguration {{ $webhookName }} -o jsonpath='{.webhooks[0].clientConfig.caBundle}'))
{{- end }}
{{- end }}

CA_BUNDLE=${caBundles[0]}
for caBundle in "${caBundles[@]}"; do
if [[ "$caBundle" != "$CA_BUNDLE" ]]; then
echo "Mismatch found between ValidatingWebhookConfiguration caBundle values."
GENERATE_CERTIFICATE=true
fi
done

SECRET_NAME={{ include "cloudzero-agent.tlsSecretName" . }}
NAMESPACE={{ .Release.Namespace }}

EXISTING_TLS_CRT=$(kubectl get secret $SECRET_NAME -n $NAMESPACE -o jsonpath='{.data.tls\.crt}')
EXISTING_TLS_KEY=$(kubectl get secret $SECRET_NAME -n $NAMESPACE -o jsonpath='{.data.tls\.key}')

if [[ -n "$EXISTING_TLS_CRT" ]]; then
# Check if the SANs in the certificate match the service name
SAN=$(echo "$EXISTING_TLS_CRT" | base64 -d | openssl x509 -text -noout | grep DNS | sed 's/.*DNS://')
if [[ "$SAN" != "{{ include "cloudzero-agent.serviceName" . }}.{{ .Release.Namespace }}.svc" ]]; then
echo "The SANs in the certificate do not match the service name."
GENERATE_CERTIFICATE=true
fi
# Check that caBundle and tls.crt are the same
if [[ "$CA_BUNDLE" != $EXISTING_TLS_CRT ]]; then
echo "The caBundle in the ValidatingWebhookConfiguration does not match the tls.crt in the TLS Secret."
GENERATE_CERTIFICATE=true
fi
fi

# Check if the TLS Secret already has certificate information
if [[ -z "$EXISTING_TLS_CRT" ]] || [[ -z "$EXISTING_TLS_KEY" ]] || [[ $MISSING_CA_BUNDLE == "true" ]] ; then
echo "The TLS Secret and/or at least one webhook configuration contains empty certificate information, or forceInit is enabled. Creating a new certificate..."
if [[ -z "$EXISTING_TLS_CRT" ]] || [[ -z "$EXISTING_TLS_KEY" ]] || [[ $GENERATE_CERTIFICATE == "true" ]] ; then
echo "The TLS Secret and/or at least one webhook configuration contains empty certificate information, or the certificate is invalid/expired. Creating a new certificate..."
else
echo "The TLS Secret and all webhook configurations contain non-empty certificate information. Will not create a new certificate and will not patch resources."
exit 0
Expand Down
2 changes: 2 additions & 0 deletions charts/cloudzero-agent/values.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -157,6 +157,7 @@ initBackfillJob:
# tag: 0.1.1
# pullPolicy: Always
enabled: true
ttlSecondsAfterFinished: 180

# -- This is a deprecated field that is replaced by initBackfillJob. However, the fields are identical, and initScrapeJob can still be used to configure the backFill/scrape Job.
# initScrapeJob:
Expand All @@ -181,6 +182,7 @@ initCertJob:
serviceAccountName: ""
clusterRoleName: ""
clusterRoleBindingName: ""
ttlSecondsAfterFinished: 180

kubeStateMetrics:
enabled: true
Expand Down