Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CP-19493: update chart deployment for new tool #50

Conversation

josephbarnett
Copy link
Contributor

@josephbarnett josephbarnett commented Jul 7, 2024

Description

  • Remove old validator code as new golang based image will be used
  • Remove old-unused github action workflow steps and workflows
  • Create new ConfigMap for validator steps (initContainer, lifecycle.postStart, lifecycle.preStop)
  • Update deployment steps to use new validator utility (initContainer, lifecycle.postStart, lifecycle.preStop)
  • Update guide for CICD
  • Update guide for troubleshooting

Pre-Dependencies

Testing

  • This change updates test coverage for new/changed/fixed functionality

Checklist

  • I have added documentation for new/changed functionality in this PR
  • All active GitHub checks for tests, formatting, and security are passing
  • The correct base branch is being used, if not main

manual deployment testing

  1. deploy the chart with child chart kube-state-metrics and existing deployment of promethues-node-exporter service endpoints
$ cd charts/cloudzero-agent
$ helm install cloudzero-agent . \
  --namespace $NS \
  --set=existingSecretName=api-token \
  --set=clusterName=jb-test-cluster \
  --set=cloudAccountId=00000000 \
  --set=region=us-east-1 \
  --set=kube-state-metrics.enabled=true \
  --set=validator.serviceEndpoints.prometheusNodeExporter=node-exporter.monitoring.svc.cluster.local:9100
NAME: cloudzero-agent
LAST DEPLOYED: Wed Jul 10 00:59:23 2024
NAMESPACE: cloudzero-agent
STATUS: deployed
REVISION: 1
TEST SUITE: None
  1. check the logs of cloudzero-agent-server pod
$ kubectl -n $NS get pods
NAME                                                  READY   STATUS     RESTARTS   AGE
cloudzero-agent-kube-state-metrics-6579b9786b-5hm4h   1/1     Running    0          16s
cloudzero-agent-server-6fff4f4d5f-9rzvg               0/2     Init:0/1   0          16s
  1. Dump the env-validator logs
$ kubectl -n $NS logs -f -c env-validator cloudzero-agent-server-6fff4f4d5f-9rzvg | jq
{
  "account": "00000000",
  "region": "us-east-1",
  "name": "jb-test-cluster",
  "state": "STATUS_TYPE_INIT_OK",
  "validatorVersion": "cloudzero-agent-validator.5e5f55b.refs/tags/v0.1.0-2024-07-10T03:43:35Z",
  "checks": [
    {
      "name": "egress_reachable",
      "passing": true
    },
    {
      "name": "api_key_valid",
      "passing": true
    }
  ]
}
  1. Check the logs of cloudzero-agent-server pod in container cloudzero-agent-validator logs
$ LOG=$(k -n $NS exec -ti -c cloudzero-agent-server cloudzero-agent-server-6fff4f4d5f-9rzvg -- sh -c 'cat cloudzero-agent-validator.log')
$ echo $LOG | jq
{
  "level": "info",
  "log_sequence": 1,
  "msg": "reporting status",
  "report": {
    "account": "00000000",
    "region": "us-east-1",
    "name": "jb-test-cluster",
    "state": "STATUS_TYPE_POD_STARTED",
    "scrapeConfig": "# my global config\nglobal:\n  scrape_interval: 15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.\n  evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.\n  # scrape_timeout is set to the global default (10s).\n\n# Alertmanager configuration\nalerting:\n  alertmanagers:\n    - static_configs:\n        - targets:\n          # - alertmanager:9093\n\n# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.\nrule_files:\n  # - \\"first_rules.yml\\"\n  # - \\"second_rules.yml\\"\n\n# A scrape configuration containing exactly one endpoint to scrape:\n# Here it's Prometheus itself.\nscrape_configs:\n  # The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.\n  - job_name: \\"prometheus\\"\n\n    # metrics_path defaults to '/metrics'\n    # scheme defaults to 'http'.\n\n    static_configs:\n      - targets: [\\"localhost:9090\\"]\n\nglobal:\n  scrape_interval: 60s\nscrape_configs:\n  - job_name: cloudzero-service-endpoints # kube_*, node_* metrics\n    honor_labels: true\n    honor_timestamps: true\n    track_timestamps_staleness: false\n    scrape_interval: 1m\n    scrape_timeout: 10s\n    scrape_protocols:\n    - OpenMetricsText1.0.0\n    - OpenMetricsText0.0.1\n    - PrometheusText0.0.4\n    metrics_path: /metrics\n    scheme: http\n    enable_compression: true\n    follow_redirects: true\n    enable_http2: true\n    relabel_configs:\n    - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scrape]\n      separator: ;\n      regex: \\"true\\"\n      replacement: $1\n      action: keep\n    - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scrape_slow]\n      separator: ;\n      regex: \\"true\\"\n      replacement: $1\n      action: drop\n    - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scheme]\n      separator: ;\n      regex: (https?)\n      target_label: __scheme__\n      replacement: $1\n      action: replace\n    - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_path]\n      separator: ;\n      regex: (.+)\n      target_label: __metrics_path__\n      replacement: $1\n      action: replace\n    - source_labels: [__address__, __meta_kubernetes_service_annotation_prometheus_io_port]\n      separator: ;\n      regex: (.+?)(?::\\d+)?;(\\d+)\n      target_label: __address__\n      replacement: $1:$2\n      action: replace\n    - separator: ;\n      regex: __meta_kubernetes_service_annotation_prometheus_io_param_(.+)\n      replacement: __param_$1\n      action: labelmap\n    - separator: ;\n      regex: __meta_kubernetes_service_label_(.+)\n      replacement: $1\n      action: labelmap\n    - source_labels: [__meta_kubernetes_namespace]\n      separator: ;\n      regex: (.*)\n      target_label: namespace\n      replacement: $1\n      action: replace\n    - source_labels: [__meta_kubernetes_service_name]\n      separator: ;\n      regex: (.*)\n      target_label: service\n      replacement: $1\n      action: replace\n    - source_labels: [__meta_kubernetes_pod_node_name]\n      separator: ;\n      regex: (.*)\n      target_label: node\n      replacement: $1\n      action: replace\n    metric_relabel_configs:\n    - source_labels: [__name__]\n      regex: \\"^(kube_node_info|kube_node_status_capacity|kube_pod_container_resource_limits|kube_pod_container_resource_requests|kube_pod_labels|kube_pod_info|node_dmi_info)$\\"\n      action: keep\n    - action: labelkeep\n      regex: \\"^(board_asset_tag|container|created_by_kind|created_by_name|image|instance|name|namespace|node|node_kubernetes_io_instance_type|pod|product_name|provider_id|resource|unit|uid|_.*|label_.*)$\\"\n  kubernetes_sd_configs:\n  - role: endpoints\n    kubeconfig_file: \\"\\"\n    follow_redirects: true\n    enable_http2: true\n  - job_name: cloudzero-nodes-cadvisor # container_* metrics\n    honor_timestamps: true\n    track_timestamps_staleness: false\n    scrape_interval: 1m\n    scrape_timeout: 10s\n    scrape_protocols:\n    - OpenMetricsText1.0.0\n    - OpenMetricsText0.0.1\n    - PrometheusText0.0.4\n    metrics_path: /metrics\n    scheme: https\n    enable_compression: true\n    authorization:\n      type: Bearer\n      credentials_file: /var/run/secrets/kubernetes.io/serviceaccount/token\n    tls_config:\n      ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt\n      insecure_skip_verify: true\n    follow_redirects: true\n    enable_http2: true\n    relabel_configs:\n    - separator: ;\n      regex: __meta_kubernetes_node_label_(.+)\n      replacement: $1\n      action: labelmap\n    - separator: ;\n      regex: (.*)\n      target_label: __address__\n      replacement: kubernetes.default.svc:443\n      action: replace\n    - source_labels: [__meta_kubernetes_node_name]\n      separator: ;\n      regex: (.+)\n      target_label: __metrics_path__\n      replacement: /api/v1/nodes/$1/proxy/metrics/cadvisor\n      action: replace\n    - source_labels: [__meta_kubernetes_node_name]\n      target_label: node\n      action: replace\n    metric_relabel_configs:\n    - action: labelkeep\n      regex: \\"^(board_asset_tag|container|created_by_kind|created_by_name|image|instance|name|namespace|node|node_kubernetes_io_instance_type|pod|product_name|provider_id|resource|unit|uid|_.*|label_.*)$\\"\n    - source_labels: [__name__]\n      regex: \\"^(container_cpu_usage_seconds_total|container_memory_working_set_bytes|container_network_receive_bytes_total|container_network_transmit_bytes_total)$\\"\n      action: keep\n    kubernetes_sd_configs:\n    - role: node\n      kubeconfig_file: \\"\\"\n      follow_redirects: true\n      enable_http2: true\n",
    "validatorVersion": "cloudzero-agent-validator.5e5f55b.refs/tags/v0.1.0-2024-07-10T03:43:35Z",
    "k8sVersion": "1.29",
    "checks": [
      {
        "name": "scrape_cfg",
        "passing": true
      },
      {
        "name": "scrape_cfg",
        "passing": true
      },
      {
        "name": "k8s_version",
        "passing": true
      },
      {
        "name": "kube_state_metrics_reachable",
        "passing": true
      },
      {
        "name": "node_exporter_reachable",
        "passing": true
      }
    ]
  }
}
  1. Checking live data!

Screenshot 2024-07-10 at 1 14 11 AM

Note inspecting these records show us going from STATUS_TYPE_INIT_OK -> STATUS_TYPE_POD_STARTED state!

  1. Inspecting the cluster record to ensure we have one, and it has the correct state STATUS_TYPE_POD_STARTED AND that is has the correct dates (first_created_at, ... vs state_updated_at)

Screenshot 2024-07-10 at 1 22 48 AM

@josephbarnett josephbarnett marked this pull request as ready for review July 8, 2024 15:44
@josephbarnett josephbarnett requested a review from a team as a code owner July 8, 2024 15:44
@josephbarnett josephbarnett changed the title CP-19492: update chart deployment for new tool CP-19493: update chart deployment for new tool Jul 8, 2024
Copy link
Collaborator

@dmepham dmepham left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm! I just have minor comments. the only thing that I would suggest is adding to the changelog, though I know you had mentioned a github action step for automating that?

charts/cloudzero-agent/CHANGELOG.md Show resolved Hide resolved
charts/cloudzero-agent/values.yaml Outdated Show resolved Hide resolved
@josephbarnett josephbarnett merged commit e850e3e into Cloudzero:develop Jul 10, 2024
2 checks passed
@josephbarnett josephbarnett deleted the cp-19492-update-chart-deployment-for-new-tool branch July 10, 2024 14:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants