Atlas testing of AWS v20 together with Vintage-CAPA migration #3209

T-Kukawka · 2024-02-01T14:38:56Z

The time has come to start testing final releases as well as migration from Vintage v20 to CAPA. We have created a dedicated Vintage MC garfish to perform any vintage or migration testing for stability purposes. The dedicated CAPA cluster for migration will be the CAPA stable-testing MC grizzly.

We would kindly ask all teams to perform comprehensive tests for 3 use-cases, ordered in terms of priorities if they can't be performed all at once.

1. Vintage `AWS v20`

Cluster creation on garfish - giantswarm Organization

This is the last release of Vintage containing 1.25 k8s. The 1.25 kubernetes introduces a breaking change in terms of removal PSPs from its API, meaning that all workloads will have to comply with the global toggle disabling PSPs as in 19.3.x release. Prior to making v20 release available to customers, we need to validate that all applications are running smoothly. The Vintage tests are standard as always - you just create the v20 release and validate your applications. Separate stable MC in this case will guarantee no manual changes in the release and stability.

Atlas finished with Vintage v20 testing (please mark it in main issue as well)

2. CAPA `0.60.0`

Cluster creation on grizzly - giantswarm Organization - be aware that this is production MC, so it will page everyone. In practice any CAPA MC should work for this test.

Starting with cluster-aws-v0.60.0 and default-apps-aws-v0.45.1 onwards CAPA supports Kubernetes 1.25 with all needed features to run our workloads in the same manner as on VIntage clusters. Please for testing use always latest cluster-aws as well as default-apps-aws releases.

Atlas finished with CAPA 1.25 testing (please mark it in main issue as well)

3. Vintage to CAPA migration

Cluster creation for migration on garfish - capa-migration-testing Organization. Clusters will be migrated to grizzly - capa-migration-testing Organization.

Phoenix and Honeybadger worked extensively on making the migration as smooth as possible. The migration-cli has been introduced that orchestrates migration of apps as well as infrastructure. Here the main point is to discover if your application and any custom configurations that could be applied by customers are migrated properly.

The migration-cli has been extended to facilitate easy testing for all teams ad Giant Swarm. Please follow the requirements as well as the procedure that is described in the tests section of the tool. In case of any issue with infrastructure - ping Phoenix, if the app/configmap migration will face any issues or inconsistencies - ping Honeybadger.

Atlas finished with Vintage to CAPA migration testing (please mark it in main issue as well)

### Tasks
- [x] prometheus operator and kube-state-metrics
- [x] metrics-server
- [x] loki (with s3 backed storage)
- [x] promtail
- [x] grafana-agent
- [x] fluent-logshipping-app
- [x] grafana
- [x] Test prometheus operator with a different namespace as it appears this is getting set to kube-system automatically
- [ ] https://github.com/giantswarm/roadmap/issues/3249
- [ ] https://github.com/giantswarm/giantswarm/issues/29861

The text was updated successfully, but these errors were encountered:

QuentinBisson · 2024-02-01T22:27:04Z

Regarding vintage 20.0.0, I've tested both creation of v20 and an upgrade from v19.3.1 to v20 on garfish and it is working.

I have a bit of concern regarding the grafana-agent and the logging operator creating it's config as I noticed that the secret was not created during my initial tests but it will solve itself after 12 hours or a restart of the logging operator so it is not blocking. @marieroque I recall that you faced something similar in an incident (grafana-agent secret missing or smth). Could you take a quick look ?

Tested both creation and upgrade

@T-Kukawka as a remainder, we still need to discuss the release notes as the observability-bundle in release 20.0.0 has a few breaking changes :)

T-Kukawka · 2024-02-02T08:03:41Z

awesome progress ❤️ Release notes are waiting for everyone to finish, you will be pinged when i am back :)

QuentinBisson · 2024-02-06T14:36:59Z

So for confirmation, the part 3 is about all our managed apps and possible customer configs :(

alex-dabija · 2024-02-06T15:09:03Z

So for confirmation, the part 3 is about all our managed apps and possible customer configs :(

Yes, because we need to know if customers can be migrated safely in order to reduce the risk.

QuentinBisson · 2024-02-06T15:13:10Z

For sure, we will try to have as much different configs as possible for that

QuentinBisson · 2024-02-08T10:44:55Z

@giantswarm/team-atlas to make sure we test the migration properly, I would like if we could deploy all apps we have (i know this will be painful) on a 19.3.0 WC on garfish, then upgrade to v20 and then run the migration-cli tool.

In an effort to not have to redo all this again, maybe we can setup a template to set as much as possible up?

I'm pretty confident the migration will break apps that use irsa like Loki so that's all the more intesting to test it

QuentinBisson · 2024-02-12T14:47:22Z

@giantswarm/team-honeybadger I'm not sure why this happened during the migration phase but the loki app that i had deployed before the migration was renamed to oki on the workload cluster on gazelle:

Charts on the WC:

Apps on the MC:

Could you investigate why ?

Generated app:

apiVersion: application.giantswarm.io/v1alpha1
133 kind: App
134 metadata:
135   labels:
136     app-operator.giantswarm.io/version: 6.10.2
137     app.kubernetes.io/name: loki
138     giantswarm.io/cluster: atlastest
139     policy.giantswarm.io/psp-status: disabled
140   name: atlastest-oki
141   namespace: org-capa-migration-testing
142 spec:
143   catalog: giantswarm
144   config:
145     configMap:
146       name: atlastest-cluster-values
147       namespace: org-capa-migration-testing
148     secret:
149       name: ""
150       namespace: ""
151   extraConfigs:
152   - kind: configMap
153     name: atlastest-psp-removal-patch-loki
154     namespace: org-capa-migration-testing
155     priority: 150
156   kubeConfig:
157     context:
158       name: atlastest-kubeconfig
159     inCluster: false
160     secret:
161       name: atlastest-kubeconfig
162       namespace: org-capa-migration-testing
163   name: loki
164   namespace: loki
165   userConfig:
166     configMap:
167       name: atlastest-oki-user-values
168       namespace: org-capa-migration-testing
169   version: 0.15.1

QuentinBisson · 2024-02-12T14:48:17Z

This is the issue that is happening for the chart-operator-externsions:
reason: 'object already exists: (rendered manifests contain a resource that already
exists. Unable to continue with install: ServiceMonitor "chart-operator" in
namespace "giantswarm" exists and cannot be imported into the current release:
invalid ownership metadata; annotation validation error: key "meta.helm.sh/release-namespace"
must equal "giantswarm": current value is "kube-system")'
status: already-exists

QuentinBisson · 2024-02-12T14:52:47Z

@T-Kukawka once the loki -> oki issue is fixed on honeybadger side, then I think atlas would only have to redo the tests with Loki and those 2 items:

The fluent-logshipping-app change will be the main issue I think

QuentinBisson · 2024-02-13T21:35:45Z

Second issue for @giantswarm/team-honeybadger but user-values configmap are not transfered when they are set on a default app.

on my garfish WC, I have this set by the cluster-operator using the app.kubernetes.io/name=observability-bundle

  userConfig:
    configMap:
      name: atlastest-observability-bundle-user-values
      namespace: atlastest

But on the gazelle MC, this is rendered without the uservalues configmap

spec:
  catalog: default
  config:
    configMap:
      name: atlastest-cluster-values
      namespace: org-capa-migration-testing
    secret:
      name: ""
      namespace: ""
  extraConfigs:
  - kind: configMap
    name: psp-removal-patch
    namespace: org-capa-migration-testing
    priority: 150
  - kind: configMap
    name: atlastest-observability-bundle-logging-extraconfig
    namespace: org-capa-migration-testing
    priority: 25
  - kind: configMap
    name: psp-removal-patch
    namespace: org-capa-migration-testing
    priority: 150
  install: {}
  kubeConfig:
    context:
      name: ""
    inCluster: true
    secret:
      name: ""
      namespace: ""
  name: observability-bundle
  namespace: org-capa-migration-testing
  namespaceConfig: {}
  rollback: {}
  uninstall: {}
  upgrade: {}
  userConfig:
    configMap:
      name: ""
      namespace: ""
    secret:
      name: ""
      namespace: ""
  version: 1.2.1

I would expect them to be added to the app or to the default-apps-aws user values but it is empty:

k get cm -n org-capa-migration-testing atlastest-default-apps-userconfig -oyaml
apiVersion: v1
data:
  values: |
    clusterName: atlastest
    organization: capa-migration-testing
kind: ConfigMap
metadata:
  annotations:
    kubectl.kubernetes.io/last-applied-configuration: |
      {"apiVersion":"v1","data":{"values":"clusterName: atlastest\norganization: capa-migration-testing\n"},"kind":"ConfigMap","metadata":{"annotations":{},"creationTimestamp":null,"labels":{"giantswarm.io/cluster":"atlastest"},"name":"atlastest-default-apps-userconfig","namespace":"org-capa-migration-testing"}}
  creationTimestamp: "2024-02-13T21:09:53Z"
  labels:
    app-operator.giantswarm.io/watching: "true"
    giantswarm.io/cluster: atlastest
  name: atlastest-default-apps-userconfig
  namespace: org-capa-migration-testing
  resourceVersion: "165069994"
  uid: 867a30ee-713b-4338-8c08-168389f9c5e6

ljakimczuk · 2024-02-14T13:16:50Z

Hey @QuentinBisson. @nce who dealt with migration is on vacations, but according to my knowledge the migration of default apps is no on us. If I get it right, the Observability Bundle, as part of the Default Apps app, should get configured by the CAPI migration CLI.

giantswarm/roadmap#3209 (comment) Refactored code for better testing; added regression test

* Fix app naming bug bc/ of wrong trimming giantswarm/roadmap#3209 (comment) Refactored code for better testing; added regression test * refactor

QuentinBisson · 2024-02-19T15:14:01Z

Loki migration works fine but I could not test user values. I will run it tomorrow.

User-values for default apps have been successfully migrated. Once loki test have been run, then all that;s left is to release a new Keda app version to support kubernetes 1.25 and add irsa support to fluent-logshipping-app

T-Kukawka · 2024-02-20T15:53:23Z

I have made adjustments in the tracking ticket as well as the teams tickets regarding the CAPA and migration testing instructions.

TL;DR: Testing of CAPA/Migration is moved from gazelle to grizzly

Initially gazelle has been chosen to test the CAPA migration as it is a Production MC, meaning most stable one. However this has resulted in unforeseen pages towards kaas-cloud oncall that we would like to limit.

We do recognise the pages and also actively work on testing, hence such pages are just a distraction away from the operations clusters that most of the teams have migrated the GS production workloads on.

Taking all the facts into consideration we have decided that it would be best to move the testing to grizzly which is stable-testing installation. Installation is primarily running e2e test and is treated as stable (no changes on the MCs).
Thanks for understanding and let us know if something is not working

QuentinBisson · 2024-02-21T14:07:38Z

All our apps have been tested. Now we need to close #3249 and https://github.com/giantswarm/giantswarm/issues/29861 and we are done

QuentinBisson · 2024-02-22T08:46:25Z

Prometheus-operator with a PV and changed namespace have been tested
Loki with s3 backed storage have been tested

QuentinBisson · 2024-02-22T08:46:47Z

All we have left is about keda

QuentinBisson · 2024-02-22T14:47:55Z

So Keda also supports up to 1.25, let's discuss when we need 1.26 support

T-Kukawka mentioned this issue Feb 1, 2024

General testing of AWS v20 together with Vintage-CAPA migration #3204

Closed

21 tasks

github-project-automation bot added this to Roadmap Feb 1, 2024

github-project-automation bot moved this to Inbox 📥 in Roadmap Feb 1, 2024

T-Kukawka added team/atlas Team Atlas capi/migration labels Feb 1, 2024

QuentinBisson added Planning and removed Planning labels Feb 6, 2024

QuentinBisson self-assigned this Feb 12, 2024

nce added a commit to giantswarm/app-migration-cli that referenced this issue Feb 17, 2024

Fix app naming bug bc/ of wrong trimming

67c6720

giantswarm/roadmap#3209 (comment) Refactored code for better testing; added regression test

nce added a commit to giantswarm/app-migration-cli that referenced this issue Feb 19, 2024

Fix app prefix naming (#4)

5a47a9b

* Fix app naming bug bc/ of wrong trimming giantswarm/roadmap#3209 (comment) Refactored code for better testing; added regression test * refactor

QuentinBisson closed this as completed Feb 22, 2024

github-project-automation bot moved this from Inbox 📥 to Done ✅ in Roadmap Feb 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Atlas testing of AWS v20 together with Vintage-CAPA migration #3209

Atlas testing of AWS v20 together with Vintage-CAPA migration #3209

T-Kukawka commented Feb 1, 2024 •

edited by QuentinBisson

Loading

QuentinBisson commented Feb 1, 2024 •

edited

Loading

T-Kukawka commented Feb 2, 2024

QuentinBisson commented Feb 6, 2024

alex-dabija commented Feb 6, 2024

QuentinBisson commented Feb 6, 2024

QuentinBisson commented Feb 8, 2024

QuentinBisson commented Feb 12, 2024 •

edited

Loading

QuentinBisson commented Feb 12, 2024

QuentinBisson commented Feb 12, 2024

QuentinBisson commented Feb 13, 2024 •

edited

Loading

ljakimczuk commented Feb 14, 2024

QuentinBisson commented Feb 19, 2024

T-Kukawka commented Feb 20, 2024

QuentinBisson commented Feb 21, 2024

QuentinBisson commented Feb 22, 2024

QuentinBisson commented Feb 22, 2024

QuentinBisson commented Feb 22, 2024

Atlas testing of AWS v20 together with Vintage-CAPA migration #3209

Atlas testing of AWS v20 together with Vintage-CAPA migration #3209

Comments

T-Kukawka commented Feb 1, 2024 • edited by QuentinBisson Loading

1. Vintage AWS v20

2. CAPA 0.60.0

3. Vintage to CAPA migration

QuentinBisson commented Feb 1, 2024 • edited Loading

T-Kukawka commented Feb 2, 2024

QuentinBisson commented Feb 6, 2024

alex-dabija commented Feb 6, 2024

QuentinBisson commented Feb 6, 2024

QuentinBisson commented Feb 8, 2024

QuentinBisson commented Feb 12, 2024 • edited Loading

QuentinBisson commented Feb 12, 2024

QuentinBisson commented Feb 12, 2024

QuentinBisson commented Feb 13, 2024 • edited Loading

ljakimczuk commented Feb 14, 2024

QuentinBisson commented Feb 19, 2024

T-Kukawka commented Feb 20, 2024

QuentinBisson commented Feb 21, 2024

QuentinBisson commented Feb 22, 2024

QuentinBisson commented Feb 22, 2024

QuentinBisson commented Feb 22, 2024

T-Kukawka commented Feb 1, 2024 •

edited by QuentinBisson

Loading

1. Vintage `AWS v20`

2. CAPA `0.60.0`

QuentinBisson commented Feb 1, 2024 •

edited

Loading

QuentinBisson commented Feb 12, 2024 •

edited

Loading

QuentinBisson commented Feb 13, 2024 •

edited

Loading