Heterogenous architecture clusters

Introduces support for provisioning and upgrading heterogenous architecture clusters in phases Co-authored-by: Ben Parees <bparees@users.noreply.github.com>
openshift · Jan 24, 2022 · 2cbce2d · 2cbce2d
1 parent aa59edb
commit 2cbce2d
Show file tree

Hide file tree

Showing 2 changed files with 262 additions and 0 deletions.
diff --git a/enhancements/multi-arch/heterogeneous-architecture.md b/enhancements/multi-arch/heterogeneous-architecture.md
@@ -0,0 +1,262 @@
+---
+title: heterogeneous-architecture-clusters
+authors:
+  - "@Prashanth684"
+reviewers:
+  - "@bparees"
+  - "@derekwaynecarr"
+  - "@jupierce"
+  - "@sdodson"
+  - "@wking"
+  - "@LalatenduMohanty"
+approvers:
+  - "@bparees"
+  - "@derekwaynecarr"
+creation-date: 2022-01-17
+last-updated: 2022-01-17
+status: implementable
+---
+
+# heterogeneous-architecture-cluster-support
+
+## Release Signoff Checklist
+
+- [x] Enhancement is `implementable`
+- [ ] Design details are appropriately documented from clear requirements
+- [ ] Test plan is defined
+- [ ] Operational readiness criteria is defined
+- [ ] Graduation criteria for tech preview and GA
+- [ ] User-facing documentation is created in [Openshift-docs](https://github.com/Openshift/Openshift-docs/)
+
+## Summary
+
+This enhancement describes the [plan][OCPPLAN-4577] to support provisioning and upgrades of heterogeneous architecture Openshift clusters. A heterogeneous cluster is a cluster with nodes running on different architectures with a homogenous control plane and worker node pools supporting any combination of architectures (eg: x86 control plane + ARM worker nodes).
+The node pools within the cluster are homogenous. This document describes the support needed for heterogeneous architecture clusters in phases. There will be three major phases with basic support in phase 1 which is targeted for 4.11 and the other phases which address gaps beyond the 4.11 time frame. The targeted platforms in the initial phases are Azure
+and AWS with support for x86 and ARM architectures.
+
+![alt heterogeneous cluster](heterogeneous-cluster.png)
+
+## Motivation
+
+The motivation for supporting heterogeneous architecture clusters is so users can take advantage of cost savings on architectures like ARM.
+
+### Goals
+
+- Provide a way to install and upgrade clusters with varied architecture node pools with the ability to autoscale heterogeneous machinesets.
+- Support provisioning heterogeneous architecture clusters in Hypershift.
+- Generate manifest listed release payloads for the 4 main architectures (x86, ARM, Power and Z).
+
+### Non-Goals
+
+- Manifestlists support for the internal image-registry, Openshift builds (hence CI will not be supported for heterogeneous architecture payloads).
+- heterogeneous control plane.
+
+## Proposal
+
+The support for installation and upgrade of a heterogeneous architecture cluster and the associated components will be done in phases. The phases range from a bare bones no upgrade support installation to the full fledged installation and upgrade capability of a cluster on different topologies. The phases are listed below and details on some components of the phases are described below. 
+
+Phase 0: Prototype
+- Aimed at delivering a prototype that users can experiment with.
+- Aimed at Azure/AWS with the primary supported architectures being x86 + ARM.
+- For supporting ARM nodes on Azure, [hyper-v support][BZ-1949614] is required which is landing in RHEL 8.6 and will be included in RHCOS for 4.11.
+- oc adm command changes to support manifestlist payloads.
+- Ability to assemble and publish manifest listed release payloads in a reproducible way.
+- Does not include upgrade support.
+- Is not intended for production.
+
+Phase 1: GA of basic support
+- Release pipeline needs to be setup to construct manifestlist based release payloads.
+- Upgrade support - CVO/Cincinnati support (heterogeneous -> heterogeneous payload upgrades only).
+- Does not include support for:
+  - image registry - internal image registry will not support manifest listed images.
+  - imagestreams importing full manifestlist content (will continue to import a single image when referencing manifestlists).
+
+Phase 2: Full basic functionality
+- Imagestream manifestlist support.
+- OLM/Operator ecosystem improvements.
+- oc debug (and other commands) support for compute architecture.
+- support toolbox.
+- support for Power and Z architectures.
+- Hypershift awareness.
+
+Phase 3: Quality of life improvements
+- First class install support - ability to create non-control-plane-architecture workers at install time and generate machineconfig and machineset manifests for the same.
+- Improve console's content filtering by making it architecture aware.
+- Possibly reduce mirroring costs by looking into things like sparse manifestlist.
+- Upgrade support for homogenous payload -> heterogeneous payload and vice versa.
+
+### User Stories
+
+- As an enterprise customer looking to save compute resource cost, I want to be able to run part of my application workload on ARM instead of x86.
+- As the owner of an application that runs partially on x86 and partially on ARM, I want to be able to run the entire application within a single cluster so I do not need to pay the cost of higher latency between clusters, or worry about the HA-resilience of cross-cluster bridges.
+
+### API Extensions
+
+Upgrades to a heterogeneous payload will require an additional field either in the Infrastructure or the ClusterVersion resource to indicate that the user wants to upgrade to a heterogeneous payload.
+
+### Implementation Details/Notes/Constraints
+
+Operators supporting multiple architectures cannot stop supporting a given architecture since it would break heterogeneous clusters which are trying to upgrade.
+
+### Risks and Mitigations
+
+Releases and upgrade graphs for all architectures have to move simultaneously and cannot be independent. If there is a bug on one architecture, it will cause the edge or the release image to be removed for all architectures.
+
+## Design Details
+
+### RHCOS Support for ARM Azure
+
+The RHCOS pipeline will need to generate Azure disk images for ARM. However as of today, RHEL does not support hyper-v for ARM and that support is landing in RHEL8.6. Hence 4.11 builds of RHCOS for ARM Azure will be based on RHEL8.6. It is TBD whether a separate pipeline with early access RHCOS builds will be needed as RHEL8.6 would not have GA'ed before 4.11.
+
+### OCP Build & Release
+
+Phase 0 will involve building and publishing of [nightly builds][OCPPLAN-7640]. The release payload will be a manifest listed image with the component images being manifest listed as well. All 4 architectures - x86, ARM, Power and Z will be included in the manifestlist. Existing architecture specific release payloads continue to be delivered through their individual
+release controllers. A new heterogeneous release controller is bootstrapped and displays "4-stable" entries for ART constructed release payloads.
+
+`oc adm release new` would be [enhanced][WRKLDS-370] to allow the creation of manifestlist release payloads. The manifestlist flow would be triggered whenever the CVO image in an imagestream is a manifestlist. If the CVO image is a standard manifest, the generated release payload will also be a manifest. If the CVO image is a manifestlist, the generated release
+payload would be a manifestlist(containing a manifest for each architecture possessed by the CVO manifestlist). In either case, oc adm release new would permit non-CVO component images to be manifest or manifestlists and pass them through directly to the resultant release manifest(s). If a manifestlist release payload is generated, each architecture specific
+release payload manifest will reference the same pullspecs provided in the input imagestream.
+
+Since the internal Openshift registry does not support manifestlists, the release controller would use quay.io. There will be a new architecture option added to Cincinnati called "multi".
+
+### Upgrades - Cincinnati and CVO
+
+In addition to the homogenous payload graph that we have today, [support][OCPPLAN-7642] needs to be added for an additional graph for heterogeneous payloads which CVO will need to look at to make upgrade recommendations. There are a few upgrade scenarios to consider with the addition of heterogeneous payloads:
+
+- heterogeneous -> heterogeneous payload upgrade
+- homogenous -> heterogeneous payload upgrade
+- heterogeneous -> homogenous payload upgrade
+
+The support for homogenous payloads will exist so customers running disconnected homogenous clusters do not incur the cost of mirroring the additional architectures if they were to use heterogeneous payloads. This means that in order for CVO to find the current release in the graph, it will have to know about the existing homogenous graph as well in addition
+to the new heterogeneous graph. There are few options on the best way to make update recommendations especially in the case of upgrading from a homogenous to a heterogeneous payload:
+
+- Have the heterogeneous graph include the homogenous releases so CVO can make recommendations based on the current homogenous release image. This is complicated by the fact that there will be multiple releases with the same version but different payloads. The trigger to update to a heterogeneous payload would be an option provided by the administrator through
+either the ClusterVersion or Infrastructure API to "move" to the heterogeneous graph.
+- Have both the graphs be completely independent of each other and have an option in one of the aforementioned APIs to make the cluster heterogeneous and CVO would request the heterogeneous graph, find the current release version and update to it.
+
+These options are under discussion and a finalized proposal will be put forth in the ticket for upgrades referenced above along with the API that would change to include the option to indicate a move to the heterogeneous payload.
+
+Cincinnati needs to [support][OCPPLAN-7643] parsing manifestlist images to be able to upgrade to a heterogeneous architecture payload.
+
+### Imagestreams
+
+Imagestreams need to better support manifestlist images. As of today they do recognize manifestlists, but it only imports images matching the control plane architecture. The idea here would be to import the manifest listed image in its entirety. This also has other implications such as the fact that the internal registry does not support manifest listed images.
+So, with full manifestlist support in imagestreams, pullthrough is not possible for manifestlists. We would ask users to use quay.io rather than use the internal registry if full manifestlist support is needed.
+
+The oc commands like debug, must-gather etc use the imagestreams and therefore would use an x86 image to deploy on the worker nodes which would fail with an ARM worker. As a workaround, they can be provided the manifest listed image with the `--image` parameter which would allow the commands to work on the worker nodes as well.
+
+### OLM
+
+Operators need to specify which architectures they support based on annotations. There also needs to be a mechanism to ensure that the architecture is actually supported if that label is present. Also, all architectures should share the same content. Images potentially need to be available on architectures where there is no support for the operator as well.
+But the filtering should make sure only the supported architectures show up.
+
+### Machine Config Operator
+
+No changes are needed in the MCO for Phase 1 as the machine-os-content image would be a manifest listed image and the machine config daemon would extract the relevant architecture's machine-os-content based on the node it runs on. In the future, architecture specific configurations should not be required, instead they need to be templatized and be generalized
+for certain items, for example kubelet system reserved memory could scale based on page size. 
+
+### Hypershift
+
+The addition of heterogeneous support in Openshift also opens up a new dimension for [support][OCPPLAN-5684] in Hypershift where the control plane and the workers could run on different architectures. There would be certain changes needed to realize this as there are limitations which need to be addressed:
+
+- When creating the hosted cluster, we need to override the `--release-image` option to specify the manifest listed release image in order to be able to deploy on a worker node of any architecture (as the default is an x86 release image). Even when that is the case there are some components like machine-approver, konnectivity-agent, etc..which have hardcoded x86 images.
+- Hypershift config would have to provide a flag to indicate that it is being installed on a heterogeneous architecture cluster in which case it will pull down images from the heterogeneous release page rather than the default amd64 for the hosted cluster.
+- The component images of the hosted control plane which default to x86 will have to provide overrides at config time. Today these can be overridden through annotations, but it would be good to have it as part of the initial creation.
+- Another option would be to mandate an x86 worker node be present for Hypershift and nodeselect the x86 node for hosted cluster deployment.
+- When creating the nodepool, the defaults for instances types and AMIs are x86 specific. The instance types can be overridden but not the AMIs. An option needs to be added to the create cluster command to override AMIs.
+
+### Console
+
+As of today console filters content based on control plane architecture. This needs to be entirely removed or made configurable for heterogeneous clusters.
+
+### Install
+
+For Phase 1, the way to add more workers to the cluster is to handcraft a machineset manifest file for a worker node and specify the architecture specific details like instance type, image references and deploy the machineset to get a worker node. This can be pseudo automated further by creating the manifests using `openshift-install create manifests` and modifying
+or adding worker manifest files. Eventually the installer should support being able to declare differing compute and control plane sections in the install-config mainly allowing architectures to be different. In doing so it would also generate appropriate machineset and machineconfig files tailored to the specific architecture. The machineset file would contain
+the instance types and region corresponding to the architecture chosen for the workers.
+
+### Open Questions
+
+- What would be the method to provide recommendation updates when moving from a homogenous graph to a heterogeneous graph ?
+- Do OLM operator images have to include all architectures, even ones that are not supported by the operator ?
+
+### Test Plan
+
+Testing for Phase 1:
+- QE would need to test heterogeneous nightly payloads before the entry is added to Cincinnati. This implies that the heterogeneous payloads will be released a bit later than the homogenous payloads.
+- QE would test heterogeneous payloads on Azure and AWS platforms with the cluster configuration mainly being x86 control plane and ARM worker nodes.
+- Testing scenarios would include machine set scaling, upgrades involving ARM nodes and OVN testing on heterogeneous clusters.
+- New release payload acceptance jobs on heterogeneous clusters.
+- QE testing. 
+
+Testing for Phase 2:
+- With the addition of Power and Z architectures, there will be more permutations to test. We would need to figure out the absolute essential set of permutations and the tests to run on those.
+- Imagestream conformance tests.
+
+### Graduation Criteria
+
+Phase 1 will have a pre-4.11 dev preview and 4.11 GA with documented limitations.
+
+#### Dev Preview -> Tech Preview
+
+TBD
+
+#### Tech Preview -> GA
+
+TBD
+
+#### Removing a deprecated feature
+
+This enhancement does not describe removing a feature of Openshift, this section is not applicable.
+
+### Upgrade / Downgrade Strategy
+
+The upgrade strategy is described in the design details section above. The important changes here being a separate channel for the heterogeneous releases and also the work needed to support upgrades from a homogenous release -> heterogeneous release and vice versa in the future.
+
+### Version Skew Strategy
+
+TBD
+
+### Operational Aspects of API Extensions
+
+There will be minor changes to the ClusterVersion or infrastructure APIs for supporting upgrades and possibly Hypershift. The imagestreams API would also undergo minor changes to introduce support for manifestlist import behavior.
+
+#### Failure Modes
+
+No new failure modes are introduced as part of this feature.
+
+#### Support Procedures
+
+TBD
+
+## Implementation History
+
+A heterogeneous payload was built as a proof of concept and spun up worker nodes using ARM with an x86 control plane on AWS. It worked, but here are the OCP features we know won’t work properly in such an environment today which are explained in detail above:
+- Imagestreams - does not fully support manifestlists, they will import a single architecture (by default based on the control plane’s architecture, though it can be chosen explicitly) from the manifestlist, so anything referencing an imagestream tag will get the image associated with that architecture, regardless of what architecture the node is running.
+- Internal registry - does not support manifestlists (storing or pulling). This also means you cannot use the internal registry for image pull-through (proxying) for manifestlist-based content.
+- Builds - do not produce manifestlists. Builds do support use of nodeselectors to steer the build to a node of the appropriate architecture.
+- Console - does some content filtering (helm charts, operators) based on the perceived architecture of the cluster, which will be the control plane architecture, so not see all content will be seen as being available.
+- Individual OLM operators - it will be a hit/miss which operators support which architectures or if they support multiple architectures.  We don’t have specific data on this yet.
+- Upgrades - we don’t have an upgrade graph for these payloads, and we have not yet tested even a manual upgrade from one heterogeneous payload to another, though we expect it to work without modification.
+- Oc adm release commands are not capable of building manifestlist based payloads today, so you can’t construct your own payload with the tooling.
+
+## Drawbacks
+
+The manifest listed release payload would contain images for the 4 architectures which would make the payload size pretty big. This would significantly increase time to mirror payloads. This is true of the OLM catalogs today which support multi-arch images and there is investigation being done on sparse manifestlists which mirror only a subset of the manifests.
+If this goal is realized, the mirroring could intelligently filter out the architectures the user doesn't care about.
+
+## Alternatives
+
+The alternative to a heterogeneous architecture cluster would be having multiple clusters with workloads communicating between each other but this would have latency and cost implications not to mention an addition failure mode of cluster connectivity.
+
+## Infrastructure Needed
+
+The release team will add a new release controller to generate heterogeneous architecture payloads starting from 4.11.
+
+[BZ-1949614]: https://bugzilla.redhat.com/show_bug.cgi?id=1949614
+[OCPPLAN-4577]: https://issues.redhat.com/browse/OCPPLAN-4577
+[OCPPLAN-7640]: https://issues.redhat.com/browse/OCPPLAN-7640
+[WRKLDS-370]: https://issues.redhat.com/browse/WRKLDS-370
+[OCPPLAN-7642]: https://issues.redhat.com/browse/OCPPLAN-7642
+[OCPPLAN-7643]: https://issues.redhat.com/browse/OCPPLAN-7643
+[OCPPLAN-5684]: https://issues.redhat.com/browse/OCPPLAN-5684
diff --git a/enhancements/multi-arch/heterogeneous-cluster.png b/enhancements/multi-arch/heterogeneous-cluster.png