Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TELCODOCS-2045 - Telco core RDS 4.18 docs #88622

Merged
merged 1 commit into from
Feb 24, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
26 changes: 6 additions & 20 deletions _topic_maps/_topic_map.yml
Original file line number Diff line number Diff line change
Expand Up @@ -3279,30 +3279,16 @@ Topics:
File: recommended-infrastructure-practices
- Name: Recommended etcd practices
File: recommended-etcd-practices
- Name: Telco core reference design
Dir: telco_core_ref_design_specs
Topics:
- Name: Telco core reference design specification
File: telco-core-rds
- Name: Telco RAN DU reference design
Dir: telco_ran_du_ref_design_specs
Topics:
- Name: Telco RAN DU RDS
- Name: Telco RAN DU reference design specification
File: telco-ran-du-rds
- Name: Reference design specifications
Dir: telco_ref_design_specs
Distros: openshift-origin,openshift-enterprise
Topics:
- Name: Telco reference design specifications
File: telco-ref-design-specs-overview
- Name: Telco core reference design specification
Dir: core
Topics:
- Name: Telco core reference design overview
File: telco-core-rds-overview
- Name: Telco core use model overview
File: telco-core-rds-use-cases
- Name: Core reference design components
File: telco-core-ref-design-components
- Name: Core reference design configuration CRs
File: telco-core-ref-crs
- Name: Telco core software specifications
File: telco-core-ref-software-artifacts
- Name: Comparing cluster configurations
Dir: cluster-compare
Distros: openshift-origin,openshift-enterprise
Expand Down
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added images/openshift-telco-core-rds-networking.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
23 changes: 23 additions & 0 deletions modules/telco-core-about-the-telco-core-cluster-use-model.adoc
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
// Module included in the following assemblies:
//
// * scalability_and_performance/telco_core_ref_design_specs/telco-core-rds.adoc

:_mod-docs-content-type: REFERENCE
[id="telco-core-about-the-telco-core-cluster-use-model_{context}"]
= About the telco core cluster use model

The telco core cluster use model is designed for clusters that run on commodity hardware.
Telco core clusters support large scale telco applications including control plane functions such as signaling, aggregation, and session border controller (SBC); and centralized data plane functions such as 5G user plane functions (UPF).
Telco core cluster functions require scalability, complex networking support, resilient software-defined storage, and support performance requirements that are less stringent and constrained than far-edge RAN deployments.

.Telco core RDS cluster service-based architecture and networking topology
image::openshift-5g-core-cluster-architecture-networking.png[5G core cluster showing a service-based architecture with overlaid networking topology]

Networking requirements for telco core functions vary widely across a range of networking features and performance points.
IPv6 is a requirement and dual-stack is common.
Some functions need maximum throughput and transaction rate and require support for user-plane DPDK networking.
Other functions use more typical cloud-native patterns and can rely on OVN-Kubernetes, kernel networking, and load balancing.

Telco core clusters are configured as standard with three control plane and two or more worker nodes configured with the stock (non-RT) kernel.
In support of workloads with varying networking and performance requirements, you can segment worker nodes by using `MachineConfigPool` custom resources (CR), for example, for non-user data plane or high-throughput use cases.
In support of required telco operational features, core clusters have a standard set of Day 2 OLM-managed Operators installed.
11 changes: 11 additions & 0 deletions modules/telco-core-additional-storage-solutions.adoc
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
// Module included in the following assemblies:
//
// * scalability_and_performance/telco_core_ref_design_specs/telco-core-rds.adoc

:_mod-docs-content-type: REFERENCE
[id="telco-core-additional-storage-solutions_{context}"]
= Additional storage solutions
You can use other storage solutions to provide persistent storage for telco core clusters.
The configuration and integration of these solutions is outside the scope of the reference design specification (RDS).

Integration of the storage solution into the telco core cluster must include proper sizing and performance analysis to ensure the storage meets overall performance and resource usage requirements.
33 changes: 33 additions & 0 deletions modules/telco-core-agent-based-installer.adoc
Original file line number Diff line number Diff line change
@@ -0,0 +1,33 @@
// Module included in the following assemblies:
//
// * scalability_and_performance/telco_core_ref_design_specs/telco-core-rds.adoc

:_mod-docs-content-type: REFERENCE
[id="telco-core-agent-based-installer_{context}"]
= Agent-based Installer

New in this release::
* No reference design updates in this release

Description::
+
--
Telco core clusters can be installed by using the Agent-based Installer.
This method allows you to install OpenShift on bare-metal servers without requiring additional servers or VMs for managing the installation.
The Agent-based Installer can be run on any system (for example, from a laptop) to generate an ISO installation image.
The ISO is used as the installation media for the cluster supervisor nodes.
Installation progress can be monitored using the ABI tool from any system with network connectivity to the supervisor node's API interfaces.

ABI supports the following:

* Installation from declarative CRs
* Installation in disconnected environments
* Installation with no additional supporting install or bastion servers required to complete the installation
--
Limits and requirements::
* Disconnected installation requires a registry that is reachable from the installed host, with all required content mirrored in that registry.

Engineering considerations::
* Networking configuration should be applied as NMState configuration during installation.
Day 2 networking configuration using the NMState Operator is not supported.
37 changes: 37 additions & 0 deletions modules/telco-core-application-workloads.adoc
Original file line number Diff line number Diff line change
@@ -0,0 +1,37 @@
// Module included in the following assemblies:
//
// * scalability_and_performance/telco_core_ref_design_specs/telco-core-rds.adoc

:_mod-docs-content-type: REFERENCE
[id="telco-core-application-workloads_{context}"]
= Application workloads

Application workloads running on telco core clusters can include a mix of high performance cloud-native network functions (CNFs) and traditional best-effort or burstable pod workloads.

Guaranteed QoS scheduling is available to pods that require exclusive or dedicated use of CPUs due to performance or security requirements.
Typically, pods that run high performance or latency sensitive CNFs by using user plane networking (for example, DPDK) require exclusive use of dedicated whole CPUs achieved through node tuning and guaranteed QoS scheduling.
When creating pod configurations that require exclusive CPUs, be aware of the potential implications of hyper-threaded systems.
Pods should request multiples of 2 CPUs when the entire core (2 hyper-threads) must be allocated to the pod.

Pods running network functions that do not require high throughput or low latency networking should be scheduled with best-effort or burstable QoS pods and do not require dedicated or isolated CPU cores.

Engineering considerations::
+
--
Use the following information to plan telco core workloads and cluster resources:

* CNF applications should conform to the latest version of https://redhat-best-practices-for-k8s.github.io/guide/[Red Hat Best Practices for Kubernetes].
* Use a mix of best-effort and burstable QoS pods as required by your applications.
** Use guaranteed QoS pods with proper configuration of reserved or isolated CPUs in the `PerformanceProfile` CR that configures the node.
** Guaranteed QoS Pods must include annotations for fully isolating CPUs.
** Best effort and burstable pods are not guaranteed exclusive CPU use.
Workloads can be preempted by other workloads, operating system daemons, or kernel tasks.
* Use exec probes sparingly and only when no other suitable option is available.
** Do not use exec probes if a CNF uses CPU pinning.
Use other probe implementations, for example, `httpGet` or `tcpSocket`.
** When you need to use exec probes, limit the exec probe frequency and quantity.
The maximum number of exec probes must be kept below 10, and the frequency must not be set to less than 10 seconds.
** You can use startup probes, because they do not use significant resources at steady-state operation.
This limitation on exec probes applies primarily to liveness and readiness probes.
Exec probes cause much higher CPU usage on management cores compared to other probe types because they require process forking.
--
Original file line number Diff line number Diff line change
@@ -0,0 +1,48 @@
// Module included in the following assemblies:
//
// * scalability_and_performance/telco_core_ref_design_specs/telco-core-rds.adoc

:_mod-docs-content-type: REFERENCE
[id="telco-core-cluster-common-use-model-engineering-considerations_{context}"]
= Telco core cluster common use model engineering considerations

* Cluster workloads are detailed in "Application workloads".
* Worker nodes should run on either of the following CPUs:
** Intel 3rd Generation Xeon (IceLake) CPUs or better when supported by {product-title}, or CPUs with the silicon security bug (Spectre and similar) mitigations turned off.
Skylake and older CPUs can experience 40% transaction performance drops when Spectre and similar mitigations are enabled.
** AMD EPYC Zen 4 CPUs (Genoa, Bergamo, or newer) or better when supported by {product-title}.
+
[NOTE]
====
Currently, per-pod power management is not available for AMD CPUs.
====
** IRQ balancing is enabled on worker nodes.
The `PerformanceProfile` CR sets `globallyDisableIrqLoadBalancing` to false.
Guaranteed QoS pods are annotated to ensure isolation as described in "CPU partitioning and performance tuning".
* All cluster nodes should have the following features:
** Have Hyper-Threading enabled
** Have x86_64 CPU architecture
** Have the stock (non-realtime) kernel enabled
** Are not configured for workload partitioning
* The balance between power management and maximum performance varies between machine config pools in the cluster.
The following configurations should be consistent for all nodes in a machine config pools group.
** Cluster scaling.
See "Scalability" for more information.
** Clusters should be able to scale to at least 120 nodes.
* CPU partitioning is configured using a `PerformanceProfile` CR and is applied to nodes on a per `MachineConfigPool` basis.
See "CPU partitioning and performance tuning" for additional considerations.
* CPU requirements for {product-title} depend on the configured feature set and application workload characteristics.
For a cluster configured according to the reference configuration running a simulated workload of 3000 pods as created by the kube-burner node-density test, the following CPU requirements are validated:
** The minimum number of reserved CPUs for control plane and worker nodes is 2 CPUs (4 hyper-threads) per NUMA node.
** The NICs used for non-DPDK network traffic should be configured to use at least 16 RX/TX queues.
** Nodes with large numbers of pods or other resources might require additional reserved CPUs.
The remaining CPUs are available for user workloads.
+
[NOTE]
====
Variations in {product-title} configuration, workload size, and workload characteristics require additional analysis to determine the effect on the number of required CPUs for the OpenShift platform.
====
40 changes: 24 additions & 16 deletions modules/telco-core-cluster-network-operator.adoc
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
// Module included in the following assemblies:
//
// * scalability_and_performance/telco_ref_design_specs/core/telco-core-ref-design-components.adoc
// * scalability_and_performance/telco_core_ref_design_specs/telco-core-rds.adoc

:_mod-docs-content-type: REFERENCE
[id="telco-core-cluster-network-operator_{context}"]
Expand All @@ -10,27 +10,35 @@ New in this release::
* No reference design updates in this release

Description::
The Cluster Network Operator (CNO) deploys and manages the cluster network components including the default OVN-Kubernetes network plugin during {product-title} cluster installation. It allows configuring primary interface MTU settings, OVN gateway modes to use node routing tables for pod egress, and additional secondary networks such as MACVLAN.
+
--
The Cluster Network Operator (CNO) deploys and manages the cluster network components including the default OVN-Kubernetes network plugin during cluster installation.
The CNO allows for configuring primary interface MTU settings, OVN gateway modes to use node routing tables for pod egress, and additional secondary networks such as MACVLAN.

In support of network traffic separation, multiple network interfaces are configured through the CNO.
Traffic steering to these interfaces is configured through static routes applied by using the NMState Operator.
To ensure that pod traffic is properly routed, OVN-K is configured with the `routingViaHost` option enabled.
This setting uses the kernel routing table and the applied static routes rather than OVN for pod egress traffic.

The Whereabouts CNI plugin is used to provide dynamic IPv4 and IPv6 addressing for additional pod network interfaces without the use of a DHCP server.
--

Limits and requirements::
* OVN-Kubernetes is required for IPv6 support.

* Large MTU cluster support requires connected network equipment to be set to the same or larger value.
MTU size up to 8900 is supported.
//https://issues.redhat.com/browse/CNF-10593
* MACVLAN and IPVLAN cannot co-locate on the same main interface due to their reliance on the same underlying kernel mechanism, specifically the `rx_handler`.
This handler allows a third-party module to process incoming packets before the host processes them, and only one such handler can be registered per network interface.
Since both MACVLAN and IPVLAN need to register their own `rx_handler` to function, they conflict and cannot coexist on the same interface.
See link:https://elixir.bootlin.com/linux/v6.10.2/source/drivers/net/ipvlan/ipvlan_main.c#L82[ipvlan/ipvlan_main.c#L82] and link:https://elixir.bootlin.com/linux/v6.10.2/source/drivers/net/macvlan.c#L1260[net/macvlan.c#L1260] for details.
* Alternative NIC configurations include splitting the shared NIC into multiple NICs or using a single dual-port NIC.
+
[IMPORTANT]
====
Splitting the shared NIC into multiple NICs or using a single dual-port NIC has not been validated with the telco core reference design.
====
* Single-stack IP cluster not validated.
Review the source code for more details:
** https://elixir.bootlin.com/linux/v6.10.2/source/drivers/net/ipvlan/ipvlan_main.c#L82[linux/v6.10.2/source/drivers/net/ipvlan/ipvlan_main.c#L82]
** https://elixir.bootlin.com/linux/v6.10.2/source/drivers/net/macvlan.c#L1260[linux/v6.10.2/source/drivers/net/macvlan.c#L1260]
* Alternative NIC configurations include splitting the shared NIC into multiple NICs or using a single dual-port NIC, though they have not been tested and validated.
* Clusters with single-stack IP configuration are not validated.
* The `reachabilityTotalTimeoutSeconds` parameter in the `Network` CR configures the `EgressIP` node reachability check total timeout in seconds.
The recommended value is `1` second.

Engineering considerations::
* Pod egress traffic is handled by kernel routing table with the `routingViaHost` option. Appropriate static routes must be configured in the host.
* Pod egress traffic is handled by kernel routing table using the `routingViaHost` option.
Appropriate static routes must be configured in the host.
47 changes: 47 additions & 0 deletions modules/telco-core-common-baseline-model.adoc
Original file line number Diff line number Diff line change
@@ -0,0 +1,47 @@
// Module included in the following assemblies:
//
// * scalability_and_performance/telco_core_ref_design_specs/telco-core-rds.adoc

:_mod-docs-content-type: REFERENCE
[id="telco-core-common-baseline-model_{context}"]
= Telco core common baseline model

The following configurations and use models are applicable to all telco core use cases.
The telco core use cases build on this common baseline of features.

Cluster topology::
Telco core clusters conform to the following requirements:

* High availability control plane (three or more control plane nodes)
* Non-schedulable control plane nodes
* Multiple machine config pools
Storage::
Telco core use cases require persistent storage as provided by {rh-storage-first}.

Networking::
Telco core cluster networking conforms to the following requirements:

* Dual stack IPv4/IPv6 (IPv4 primary).
* Fully disconnected – clusters do not have access to public networking at any point in their lifecycle.
* Supports multiple networks.
Segmented networking provides isolation between operations, administration and maintenance (OAM), signaling, and storage traffic.
* Cluster network type is OVN-Kubernetes as required for IPv6 support.
* Telco core clusters have multiple layers of networking supported by underlying RHCOS, SR-IOV Network Operator, Load Balancer and other components.
These layers include the following:
** Cluster networking layer.
The cluster network configuration is defined and applied through the installation configuration.
Update the configuration during Day 2 operations with the NMState Operator.
Use the initial configuration to establish the following:
*** Host interface configuration.
*** Active/active bonding (LACP).
** Secondary/additional network layer.
Configure the {product-title} CNI through network `additionalNetwork` or `NetworkAttachmentDefinition` CRs.
Use the initial configuration to configure MACVLAN virtual network interfaces.
** Application workload layer.
User plane networking runs in cloud-native network functions (CNFs).
Service Mesh::
Telco CNFs can use Service Mesh.
All telco core clusters require a Service Mesh implementation.
The choice of implementation and configuration is outside the scope of this specification.
Loading