Skip to content

Atlas testing of AWS v20 together with Vintage-CAPA migration #3209

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
3 tasks done
T-Kukawka opened this issue Feb 1, 2024 · 17 comments
Closed
3 tasks done

Atlas testing of AWS v20 together with Vintage-CAPA migration #3209

T-Kukawka opened this issue Feb 1, 2024 · 17 comments
Assignees

Comments

@T-Kukawka
Copy link
Contributor

T-Kukawka commented Feb 1, 2024

The time has come to start testing final releases as well as migration from Vintage v20 to CAPA. We have created a dedicated Vintage MC garfish to perform any vintage or migration testing for stability purposes. The dedicated CAPA cluster for migration will be the CAPA stable-testing MC grizzly.

We would kindly ask all teams to perform comprehensive tests for 3 use-cases, ordered in terms of priorities if they can't be performed all at once.

1. Vintage AWS v20

Cluster creation on garfish - giantswarm Organization

This is the last release of Vintage containing 1.25 k8s. The 1.25 kubernetes introduces a breaking change in terms of removal PSPs from its API, meaning that all workloads will have to comply with the global toggle disabling PSPs as in 19.3.x release. Prior to making v20 release available to customers, we need to validate that all applications are running smoothly. The Vintage tests are standard as always - you just create the v20 release and validate your applications. Separate stable MC in this case will guarantee no manual changes in the release and stability.

  • Atlas finished with Vintage v20 testing (please mark it in main issue as well)

2. CAPA 0.60.0

Cluster creation on grizzly - giantswarm Organization - be aware that this is production MC, so it will page everyone. In practice any CAPA MC should work for this test.

Starting with cluster-aws-v0.60.0 and default-apps-aws-v0.45.1 onwards CAPA supports Kubernetes 1.25 with all needed features to run our workloads in the same manner as on VIntage clusters. Please for testing use always latest cluster-aws as well as default-apps-aws releases.

  • Atlas finished with CAPA 1.25 testing (please mark it in main issue as well)

3. Vintage to CAPA migration

Cluster creation for migration on garfish - capa-migration-testing Organization. Clusters will be migrated to grizzly - capa-migration-testing Organization.

Phoenix and Honeybadger worked extensively on making the migration as smooth as possible. The migration-cli has been introduced that orchestrates migration of apps as well as infrastructure. Here the main point is to discover if your application and any custom configurations that could be applied by customers are migrated properly.

The migration-cli has been extended to facilitate easy testing for all teams ad Giant Swarm. Please follow the requirements as well as the procedure that is described in the tests section of the tool. In case of any issue with infrastructure - ping Phoenix, if the app/configmap migration will face any issues or inconsistencies - ping Honeybadger.

  • Atlas finished with Vintage to CAPA migration testing (please mark it in main issue as well)
### Tasks
- [x] prometheus operator and kube-state-metrics
- [x] metrics-server
- [x] loki (with s3 backed storage)
- [x] promtail
- [x] grafana-agent
- [x] fluent-logshipping-app
- [x] grafana
- [x] Test prometheus operator with a different namespace as it appears this is getting set to kube-system automatically
- [ ] https://github.com/giantswarm/roadmap/issues/3249
- [ ] https://github.com/giantswarm/giantswarm/issues/29861
@QuentinBisson
Copy link

QuentinBisson commented Feb 1, 2024

Regarding vintage 20.0.0, I've tested both creation of v20 and an upgrade from v19.3.1 to v20 on garfish and it is working.

I have a bit of concern regarding the grafana-agent and the logging operator creating it's config as I noticed that the secret was not created during my initial tests but it will solve itself after 12 hours or a restart of the logging operator so it is not blocking. @marieroque I recall that you faced something similar in an incident (grafana-agent secret missing or smth). Could you take a quick look ?

Tested both creation and upgrade

@T-Kukawka as a remainder, we still need to discuss the release notes as the observability-bundle in release 20.0.0 has a few breaking changes :)

@T-Kukawka
Copy link
Contributor Author

awesome progress ❤️ Release notes are waiting for everyone to finish, you will be pinged when i am back :)

@QuentinBisson
Copy link

So for confirmation, the part 3 is about all our managed apps and possible customer configs :(

@alex-dabija
Copy link

So for confirmation, the part 3 is about all our managed apps and possible customer configs :(

Yes, because we need to know if customers can be migrated safely in order to reduce the risk.

@QuentinBisson
Copy link

For sure, we will try to have as much different configs as possible for that

@QuentinBisson
Copy link

@giantswarm/team-atlas to make sure we test the migration properly, I would like if we could deploy all apps we have (i know this will be painful) on a 19.3.0 WC on garfish, then upgrade to v20 and then run the migration-cli tool.

In an effort to not have to redo all this again, maybe we can setup a template to set as much as possible up?

I'm pretty confident the migration will break apps that use irsa like Loki so that's all the more intesting to test it

@QuentinBisson
Copy link

QuentinBisson commented Feb 12, 2024

@giantswarm/team-honeybadger I'm not sure why this happened during the migration phase but the loki app that i had deployed before the migration was renamed to oki on the workload cluster on gazelle:

Charts on the WC:
Image

Apps on the MC:

Image

Could you investigate why ?

Generated app:

apiVersion: application.giantswarm.io/v1alpha1
133 kind: App
134 metadata:
135   labels:
136     app-operator.giantswarm.io/version: 6.10.2
137     app.kubernetes.io/name: loki
138     giantswarm.io/cluster: atlastest
139     policy.giantswarm.io/psp-status: disabled
140   name: atlastest-oki
141   namespace: org-capa-migration-testing
142 spec:
143   catalog: giantswarm
144   config:
145     configMap:
146       name: atlastest-cluster-values
147       namespace: org-capa-migration-testing
148     secret:
149       name: ""
150       namespace: ""
151   extraConfigs:
152   - kind: configMap
153     name: atlastest-psp-removal-patch-loki
154     namespace: org-capa-migration-testing
155     priority: 150
156   kubeConfig:
157     context:
158       name: atlastest-kubeconfig
159     inCluster: false
160     secret:
161       name: atlastest-kubeconfig
162       namespace: org-capa-migration-testing
163   name: loki
164   namespace: loki
165   userConfig:
166     configMap:
167       name: atlastest-oki-user-values
168       namespace: org-capa-migration-testing
169   version: 0.15.1

@QuentinBisson
Copy link

This is the issue that is happening for the chart-operator-externsions:
reason: 'object already exists: (rendered manifests contain a resource that already
exists. Unable to continue with install: ServiceMonitor "chart-operator" in
namespace "giantswarm" exists and cannot be imported into the current release:
invalid ownership metadata; annotation validation error: key "meta.helm.sh/release-namespace"
must equal "giantswarm": current value is "kube-system")'
status: already-exists

@QuentinBisson
Copy link

@T-Kukawka once the loki -> oki issue is fixed on honeybadger side, then I think atlas would only have to redo the tests with Loki and those 2 items:

Image

The fluent-logshipping-app change will be the main issue I think

@QuentinBisson QuentinBisson self-assigned this Feb 12, 2024
@QuentinBisson
Copy link

QuentinBisson commented Feb 13, 2024

Second issue for @giantswarm/team-honeybadger but user-values configmap are not transfered when they are set on a default app.

on my garfish WC, I have this set by the cluster-operator using the app.kubernetes.io/name=observability-bundle

  userConfig:
    configMap:
      name: atlastest-observability-bundle-user-values
      namespace: atlastest

But on the gazelle MC, this is rendered without the uservalues configmap

spec:
  catalog: default
  config:
    configMap:
      name: atlastest-cluster-values
      namespace: org-capa-migration-testing
    secret:
      name: ""
      namespace: ""
  extraConfigs:
  - kind: configMap
    name: psp-removal-patch
    namespace: org-capa-migration-testing
    priority: 150
  - kind: configMap
    name: atlastest-observability-bundle-logging-extraconfig
    namespace: org-capa-migration-testing
    priority: 25
  - kind: configMap
    name: psp-removal-patch
    namespace: org-capa-migration-testing
    priority: 150
  install: {}
  kubeConfig:
    context:
      name: ""
    inCluster: true
    secret:
      name: ""
      namespace: ""
  name: observability-bundle
  namespace: org-capa-migration-testing
  namespaceConfig: {}
  rollback: {}
  uninstall: {}
  upgrade: {}
  userConfig:
    configMap:
      name: ""
      namespace: ""
    secret:
      name: ""
      namespace: ""
  version: 1.2.1

I would expect them to be added to the app or to the default-apps-aws user values but it is empty:

k get cm -n org-capa-migration-testing atlastest-default-apps-userconfig -oyaml
apiVersion: v1
data:
  values: |
    clusterName: atlastest
    organization: capa-migration-testing
kind: ConfigMap
metadata:
  annotations:
    kubectl.kubernetes.io/last-applied-configuration: |
      {"apiVersion":"v1","data":{"values":"clusterName: atlastest\norganization: capa-migration-testing\n"},"kind":"ConfigMap","metadata":{"annotations":{},"creationTimestamp":null,"labels":{"giantswarm.io/cluster":"atlastest"},"name":"atlastest-default-apps-userconfig","namespace":"org-capa-migration-testing"}}
  creationTimestamp: "2024-02-13T21:09:53Z"
  labels:
    app-operator.giantswarm.io/watching: "true"
    giantswarm.io/cluster: atlastest
  name: atlastest-default-apps-userconfig
  namespace: org-capa-migration-testing
  resourceVersion: "165069994"
  uid: 867a30ee-713b-4338-8c08-168389f9c5e6

@ljakimczuk
Copy link

Hey @QuentinBisson. @nce who dealt with migration is on vacations, but according to my knowledge the migration of default apps is no on us. If I get it right, the Observability Bundle, as part of the Default Apps app, should get configured by the CAPI migration CLI.

nce added a commit to giantswarm/app-migration-cli that referenced this issue Feb 17, 2024
giantswarm/roadmap#3209 (comment)
Refactored code for better testing; added regression test
nce added a commit to giantswarm/app-migration-cli that referenced this issue Feb 19, 2024
* Fix app naming bug bc/ of wrong trimming

giantswarm/roadmap#3209 (comment)
Refactored code for better testing; added regression test

* refactor
@QuentinBisson
Copy link

Loki migration works fine but I could not test user values. I will run it tomorrow.

User-values for default apps have been successfully migrated. Once loki test have been run, then all that;s left is to release a new Keda app version to support kubernetes 1.25 and add irsa support to fluent-logshipping-app

@T-Kukawka
Copy link
Contributor Author

I have made adjustments in the tracking ticket as well as the teams tickets regarding the CAPA and migration testing instructions.

TL;DR: Testing of CAPA/Migration is moved from gazelle to grizzly

Initially gazelle has been chosen to test the CAPA migration as it is a Production MC, meaning most stable one. However this has resulted in unforeseen pages towards kaas-cloud oncall that we would like to limit.

We do recognise the pages and also actively work on testing, hence such pages are just a distraction away from the operations clusters that most of the teams have migrated the GS production workloads on.

Taking all the facts into consideration we have decided that it would be best to move the testing to grizzly which is stable-testing installation. Installation is primarily running e2e test and is treated as stable (no changes on the MCs).
Thanks for understanding and let us know if something is not working

@QuentinBisson
Copy link

All our apps have been tested. Now we need to close #3249 and https://github.com/giantswarm/giantswarm/issues/29861 and we are done

@QuentinBisson
Copy link

Prometheus-operator with a PV and changed namespace have been tested
Loki with s3 backed storage have been tested

@QuentinBisson
Copy link

All we have left is about keda

@QuentinBisson
Copy link

So Keda also supports up to 1.25, let's discuss when we need 1.26 support

@github-project-automation github-project-automation bot moved this from Inbox 📥 to Done ✅ in Roadmap Feb 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Archived in project
Development

No branches or pull requests

4 participants