Telco core RDS 4.18 docs

openshift · Feb 14, 2025 · 85d2aca · 85d2aca
1 parent e2001e0
commit 85d2aca
Show file tree

Hide file tree

Showing 51 changed files with 1,059 additions and 513 deletions.
diff --git a/_topic_maps/_topic_map.yml b/_topic_maps/_topic_map.yml
@@ -3256,6 +3256,11 @@ Topics:
     File: recommended-infrastructure-practices
   - Name: Recommended etcd practices
     File: recommended-etcd-practices
+- Name: Telco core reference design specifications
+  Dir: telco_core_ref_design_specs
+  Topics:
+  - Name: Telco core reference design specifications
+    File: telco-core-rds
 - Name: Reference design specifications
   Dir: telco_ref_design_specs
   Distros: openshift-origin,openshift-enterprise
@@ -3275,19 +3280,6 @@ Topics:
       File: telco-ran-ref-du-crs
     - Name: Telco RAN DU software specifications
       File: telco-ran-ref-software-artifacts
-  - Name: Telco core reference design specification
-    Dir: core
-    Topics:
-    - Name: Telco core reference design overview
-      File: telco-core-rds-overview
-    - Name: Telco core use model overview
-      File: telco-core-rds-use-cases
-    - Name: Core reference design components
-      File: telco-core-ref-design-components
-    - Name: Core reference design configuration CRs
-      File: telco-core-ref-crs
-    - Name: Telco core software specifications
-      File: telco-core-ref-software-artifacts
 - Name: Comparing cluster configurations
   Dir: cluster-compare
   Distros: openshift-origin,openshift-enterprise

diff --git a/images/openshift-telco-core-rds-metallb-service-separation.png b/images/openshift-telco-core-rds-metallb-service-separation.png
diff --git a/images/openshift-telco-core-rds-networking.png b/images/openshift-telco-core-rds-networking.png
diff --git a/modules/telco-core-about-the-telco-core-cluster-use-model.adoc b/modules/telco-core-about-the-telco-core-cluster-use-model.adoc
@@ -0,0 +1,21 @@
+// Module included in the following assemblies:
+//
+// * scalability_and_performance/telco_core_ref_design_specs/telco-core-rds.adoc
+
+[id="telco-core-about-the-telco-core-cluster-use-model_{context}"]
+= About the telco core cluster use model
+
+The telco core cluster use model is designed for clusters that run on commodity hardware.
+Telco core clusters support large scale telco applications including control plane functions like signaling, aggregation, session border controller (SBC); and centralized data plane functions such as 5G user plane functions (UPF).
+Telco core cluster functions require scalability, complex networking support, resilient software-defined storage, and support performance requirements that are less stringent and constrained than far-edge RAN deployments.
+
+.Telco core RDS cluster service-based architecture and networking topology
+image::openshift-5g-core-cluster-architecture-networking.png[5G core cluster showing a service-based architecture with overlaid networking topology]
+
+Networking requirements for telco core functions vary widely across a range of networking features and performance points.
+IPv6 is a requirement and dual-stack is common.
+Some functions need maximum throughput and transaction rate and require support for user-plane DPDK networking while others make use of more typical cloud-native patterns and can rely on OVN-Kubernetes, kernel networking and load balancing.
+
+Telco core clusters are configured as standard with three control plane and one or more worker nodes configured with the stock (non-RT) kernel.
+In support of workloads with varying networking and performance requirements, you can segment worker nodes by using `MachineConfigPool` custom resources (CR), for example, for non-user data plane or high-throughput use cases.
+In support of required telco operational features, core clusters have a standard set of Day 2 OLM-managed Operators installed.
diff --git a/modules/telco-core-additional-storage-solutions.adoc b/modules/telco-core-additional-storage-solutions.adoc
@@ -0,0 +1,10 @@
+// Module included in the following assemblies:
+//
+// * scalability_and_performance/telco_core_ref_design_specs/telco-core-rds.adoc
+
+[id="telco-core-additional-storage-solutions_{context}"]
+= Additional storage solutions
+You can use other storage solutions to provide persistent storage for telco core clusters.
+The configuration and integration of these solutions is outside the scope of the RDS.
+
+Integration of the storage solution into the telco core cluster must include proper sizing and performance analysis to ensure the storage meets overall performance and resource utilization requirements.
diff --git a/modules/telco-core-agent-based-installer.adoc b/modules/telco-core-agent-based-installer.adoc
@@ -0,0 +1,28 @@
+// Module included in the following assemblies:
+//
+// * scalability_and_performance/telco_core_ref_design_specs/telco-core-rds.adoc
+
+[id="telco-core-agent-based-installer_{context}"]
+= Agent-based Installer
+
+New in this release::
+No reference design updates in this release.
+
+Description::
+Telco core clusters can be installed using the Agent-based Installer.
+This method allows you to install OpenShift on bare-metal servers without requiring additional servers or VMs for managing the installation.
+The Agent-based Installer can be run on any system (for example, from a laptop) to generate an ISO installation image.
+The ISO is used as the installation media for the cluster supervisor nodes.
+Progress can be monitored using the ABI tool from any system with network connectivity to the supervisor node's API interfaces.
+
+ABI supports the following:
+
+* Installation from declarative CRs
+* Installation in disconnected environments
+* No additional servers required to support installation, for example, the bastion node is not longer needed
+
+Limits and requirements::
+* Disconnected installation requires a registry with all required content mirrored that is reachable from the installed host.
+
+Engineering considerations::
+* Networking configuration should be applied as NMState configuration during installation as opposed to Day 2 configuration using the NMState Operator.
diff --git a/modules/telco-core-application-workloads.adoc b/modules/telco-core-application-workloads.adoc
@@ -0,0 +1,35 @@
+// Module included in the following assemblies:
+//
+// * scalability_and_performance/telco_core_ref_design_specs/telco-core-rds.adoc
+
+[id="telco-core-application-workloads_{context}"]
+= Application workloads
+
+Application workloads running on telco core clusters can include a mix of high performance cloud-native network functions (CNFs) and traditional best-effort or burstable pod workloads.
+
+Guaranteed QoS scheduling is available to pods that require exclusive or dedicated use of CPUs due to performance or security requirements.
+Typically pods that run high performance or latency sensitive CNFs by using user plane networking (for example, DPDK) require exclusive use of dedicated whole CPUs achieved through node tuning and guaranteed QoS scheduling.
+When creating pod configurations that require exclusive CPUs, be aware of the potential implications of hyper-threaded systems.
+Pods should request multiples of 2 CPUs when the entire core (2 hyper-threads) must be allocated to the pod.
+
+Pods running network functions that do not require high throughput or low latency networking should be scheduled with best-effort or burstable QoS pods and do not require dedicated or isolated CPU cores.
+
+Engineering considerations::
++
+--
+Use the following information to plan telco core workloads and cluster resources.
+
+* CNF applications should conform to the latest version of https://redhat-best-practices-for-k8s.github.io/guide/[Red Hat Best Practices for Kubernetes].
+* Use a mix of best-effort and burstable QoS pods as required by your applications.
+** Use guaranteed QoS pods with proper configuration of reserved or isolated CPUs in the `PerformanceProfile` CR that configures the node.
+** Guaranteed QoS Pods must include annotations for fully isolating CPUs.
+** Best effort and burstable pods are not guaranteed exclusive use of a CPU.
+Workloads may be preempted by other workloads, OS daemons, or kernel tasks.
+* Use exec probes sparingly and only when no other suitable options are available.
+** Do not use exec probes if a CNF uses CPU pinning.
+Use other probe implementations, for example, `httpGet` or `tcpSocket`.
+** When you need to use exec probes, limit the exec probe frequency and quantity.
+The maximum number of exec probes must be kept below 10, and frequency must not be set to less than 10 seconds.
+** You can use startup probes, since they do not use significant resources at steady-state operation.
+The limitation on exec probes applies primarily to liveness and readiness probes.
+--
diff --git a/modules/telco-core-cluster-common-use-model-engineering-considerations.adoc b/modules/telco-core-cluster-common-use-model-engineering-considerations.adoc
@@ -0,0 +1,47 @@
+// Module included in the following assemblies:
+//
+// * scalability_and_performance/telco_core_ref_design_specs/telco-core-rds.adoc
+
+[id="telco-core-cluster-common-use-model-engineering-considerations_{context}"]
+= Telco core cluster common use model engineering considerations
+
+. Cluster workloads are detailed in "Application workloads".
+. Worker nodes should run on either of the following CPUs:
+.. Intel 3rd Generation Xeon (IceLake) CPUs or better when supported by {product-title}, or CPUs with the silicon security bug (Spectre and similar) mitigations turned off.
+Skylake and older CPUs can experience 40% transaction performance drops when Spectre and similar mitigations are enabled.
+.. AMD EPYC Zen 4 CPUs (Genoa, Bergamo, or newer) or better when supported by {product-title}.
++
+[NOTE]
+====
+Power consumption evaluations for these CPUs are ongoing.
+You should evaluate CPU features such as per-pod power management to determine any potential impact on your requirements.
+====
+.. IRQ balancing is enabled on worker nodes.
+The `PerformanceProfile` CR sets `globallyDisableIrqLoadBalancing` to false.
+Guaranteed QoS pods are annotated to ensure isolation as described in "CPU partitioning and performance tuning".
+
+. All cluster nodes should have the following features:
+.. Hyper-Threading enabled
+.. x86_64 CPU architecture
+.. Stock (non-realtime) kernel enabled
+.. Nodes not configured for workload partitioning
+
+. The balance between power management and maximum performance varies between `MachineConfigPool` groups in the cluster.
+The following configurations should be consistent for all nodes in a `MachineConfigPool` group.
+
+. Cluster scaling: see "Scalability" for more information.
+.. Clusters should be able to scale to at least 120 nodes.
+
+. CPU partitioning is configured using a `PerformanceProfile` CR and is applied to nodes on a per `MachineConfigPool` basis.
+See "CPU partitioning and performance tuning" for additional considerations.
+. CPU requirements for the OpenShift platform depend on the configured feature set and application workload characteristics.
+For a cluster configured according to the reference configuration, running a simulated workload of 3k pods as created by kube-burner node-density test the following CPU requirements are validated:
+.. The minimum number of reserved CPUs for supervisor and worker nodes is 2 CPUs (4 hyper-threads) per NUMA node.
+Nodes with large numbers of pods or other resources may require additional reserved CPUs.
+The remaining CPUs are available for user workloads.
+
++
+[NOTE]
+====
+Variations in OpenShift configuration, workload size, and workload characteristics require additional analysis to determine the effect on the number of required CPUs for the OpenShift platform.
+====
diff --git a/modules/telco-core-cluster-network-operator.adoc b/modules/telco-core-cluster-network-operator.adoc
@@ -1,36 +1,40 @@
 // Module included in the following assemblies:
 //
-// * scalability_and_performance/telco_ref_design_specs/core/telco-core-ref-design-components.adoc
+// * scalability_and_performance/telco_core_ref_design_specs/telco-core-rds.adoc
 
-:_mod-docs-content-type: REFERENCE
 [id="telco-core-cluster-network-operator_{context}"]
 = Cluster Network Operator
 
 New in this release::
-* No reference design updates in this release
+No reference design updates in this release.
 
 Description::
-The Cluster Network Operator (CNO) deploys and manages the cluster network components including the default OVN-Kubernetes network plugin during {product-title} cluster installation. It allows configuring primary interface MTU settings, OVN gateway modes to use node routing tables for pod egress, and additional secondary networks such as MACVLAN.
+The Cluster Network Operator (CNO) deploys and manages the cluster network components including default OVN-Kubernetes network plugin during cluster installation.
+It allows for configuring primary interface MTU settings, OVN gateway modes to use node routing tables for pod egress, and additional secondary networks such as MACVLAN.
+
+In support of network traffic separation, multiple network interfaces are configured through the CNO.
+Traffic steering to these interfaces is configured through static routes applied by using the NMState Operator.
+To ensure that pod traffic is properly routed, OVN-K is configured with the `routingViaHost` option enabled.
+This setting uses the kernel routing table and the applied static routes rather than OVN for pod egress traffic.
+
+The Whereabouts CNI plugin is used to provide dynamic IPv4 and IPv6 addressing for additional pod network interfaces without the use of a DHCP server.
 
 Limits and requirements::
 * OVN-Kubernetes is required for IPv6 support.
-
 * Large MTU cluster support requires connected network equipment to be set to the same or larger value.
-
+MTU size up to 8900 is supported.
+//https://issues.redhat.com/browse/CNF-10593
 * MACVLAN and IPVLAN cannot co-locate on the same main interface due to their reliance on the same underlying kernel mechanism, specifically the `rx_handler`.
 This handler allows a third-party module to process incoming packets before the host processes them, and only one such handler can be registered per network interface.
 Since both MACVLAN and IPVLAN need to register their own `rx_handler` to function, they conflict and cannot coexist on the same interface.
-See link:https://elixir.bootlin.com/linux/v6.10.2/source/drivers/net/ipvlan/ipvlan_main.c#L82[ipvlan/ipvlan_main.c#L82] and link:https://elixir.bootlin.com/linux/v6.10.2/source/drivers/net/macvlan.c#L1260[net/macvlan.c#L1260] for details.
-
-* Alternative NIC configurations include splitting the shared NIC into multiple NICs or using a single dual-port NIC.
-+
-[IMPORTANT]
-====
-Splitting the shared NIC into multiple NICs or using a single dual-port NIC has not been validated with the telco core reference design.
-====
-
-* Single-stack IP cluster not validated.
-
+Review the source code for more details:
+** https://elixir.bootlin.com/linux/v6.10.2/source/drivers/net/ipvlan/ipvlan_main.c#L82[linux/v6.10.2/source/drivers/net/ipvlan/ipvlan_main.c#L82]
+** https://elixir.bootlin.com/linux/v6.10.2/source/drivers/net/macvlan.c#L1260[linux/v6.10.2/source/drivers/net/macvlan.c#L1260]
+* Alternative NIC configurations include splitting the shared NIC into multiple NICs or using a single dual-port NIC, though they have not been tested and validated.
+* Clusters with single-stack IP configuration are not validated.
+* The `reachabilityTotalTimeoutSeconds` parameter in the `Network` CR configures the `EgressIP` node reachability check total timeout in seconds.
+The recommended value is `1` second.
 
 Engineering considerations::
-* Pod egress traffic is handled by kernel routing table with the `routingViaHost` option. Appropriate static routes must be configured in the host.
+* Pod egress traffic is handled by kernel routing table using the `routingViaHost` option.
+Appropriate static routes must be configured in the host.
diff --git a/modules/telco-core-common-baseline-model.adoc b/modules/telco-core-common-baseline-model.adoc
@@ -0,0 +1,46 @@
+// Module included in the following assemblies:
+//
+// * scalability_and_performance/telco_core_ref_design_specs/telco-core-rds.adoc
+
+[id="telco-core-common-baseline-model_{context}"]
+= Telco core common baseline model
+
+The following configurations and use models are applicable to all telco core use cases.
+The Telco core use cases build on this common baseline of features.
+
+Cluster topology::
+Telco core clusters conform to the following requirements:
+
+* High availability control plane (three or more supervisor nodes)
+* Non-schedulable supervisor nodes
+* Multiple machine config pools
+
+Storage::
+Telco core use cases require persistent storage as provided by Red Hat OpenShift Data Foundation.
+
+Networking::
+Telco core cluster networking conforms to the following requirements:
+
+* Dual stack IPv4/IPv6 (IPv4 primary)
+* Fully disconnected – clusters do not have access to public networking at any point in their lifecycle.
+* Supports multiple networks.
+Segmented networking provides isolation between operations, administration and maintenance (OAM), signaling, and storage traffic.
+* Cluster network type is OVN-Kubernetes as required for IPv6 support
+* Telco core clusters have multiple layers of networking supported by underlying RHCOS, SR-IOV Network Operator, Load Balancer and other components.
+These layers include the following:
+** Cluster networking layer.
+The cluster network configuration is defined and applied through the installation configuration.
+Update the configuration during Day 2 operations with the NMState Operator.
+Use the initial configuration to establish the following:
+*** Host interface configuration
+*** Active/active bonding (LACP)
+** Secondary/additional network layer.
+Configure OpenShift CNI through network `additionalNetwork` or `NetworkAttachmentDefinition` CRs.
+Use the initial configuration to configure MACVLAN virtual network interfaces.
+** Application workload layer.
+User plane networking runs in cloud-native network functions (CNFs).
+
+Service Mesh::
+Telco CNFs can use Service Mesh.
+All telco core clusters should include a Service Mesh implementation.
+The choice of implementation and configuration is outside the scope of this specification.