Sync with upstream main #6

clarkzinzow · 2025-01-31T02:48:38Z

No description provided.

…_FILTER variable

PrometheusRules allow recording pre-defined queries. Use `sriov_kubepoddevice` metric to add `pod|namespace` pair to the sriov metrics. Feature is enabled via the `METRICS_EXPORTER_PROMETHEUS_DEPLOY_RULE` environment variable. Signed-off-by: Andrea Panattoni <apanatto@redhat.com>

When the `metricsExporter` feature is turned off, deployed resources should be removed. These changes fix the error: ``` │ 2024-08-28T14:07:57.699760017Z ERROR controller/controller.go:266 Reconciler error {"controller": "sriovoperatorconfig", "controllerGroup": "sriovnetwork.openshift.io", "controllerKind": "SriovOperatorConfig", "SriovOperatorConfig": {"name":"default","namespace":"openshift-sriov-network-operator"}, │ │ "namespace": "openshift-sriov-network-operator", "name": "default", "reconcileID": "fa841c50-dbb8-4c4c-9ddd-b98624fd2a24", "error": "failed to delete object &{map[apiVersion:monitoring.coreos.com/v1 kind:ServiceMonitor metadata:map[name:sriov-network-metrics-exporter namespace:openshift-sriov-network-operator] │ │ spec:map[endpoints:[map[bearerTokenFile:/var/run/secrets/kubernetes.io/serviceaccount/token honorLabels:true interval:30s port:sriov-network-metrics scheme:https tlsConfig:map[caFile:/etc/prometheus/configmaps/serving-certs-ca-bundle/service-ca.crt insecureSkipVerify:false serverName:sriov-network-metrics-expor │ │ ter-service.openshift-sriov-network-operator.svc]]] namespaceSelector:map[matchNames:[openshift-sriov-network-operator]] selector:map[matchLabels:map[name:sriov-network-metrics-exporter-service]]]]} with err: could not delete object (monitoring.coreos.com/v1, Kind=ServiceMonitor) openshift-sriov-network-operato │ │ r/sriov-network-metrics-exporter: servicemonitors.monitoring.coreos.com \"sriov-network-metrics-exporter\" is forbidden: User \"system:serviceaccount:openshift-sriov-network-operator:sriov-network-operator\" cannot delete resource \"servicemonitors\" in API group \"monitoring.coreos.com\" in the namespace \"ope │ │ nshift-sriov-network-operator\""} ``` Signed-off-by: Andrea Panattoni <apanatto@redhat.com>

[metrics 4/x] Metrics exporter rules

Refactor some conformance tests to use `SRIOV_NODE_AND_DEVICE_NAME_FILTER`

if the current obj as annotation and the updated doesn't we still want to add the ones from the current object Signed-off-by: Sebastian Sch <sebassch@gmail.com>

When a user deletes the default SriovOperatorConfig resource and tries to recreate it afterwards, the operator webhooks returns the error: ``` Error from server (InternalError): error when creating "/tmp/opconfig.yml": Internal error occurred: failed calling webhook "operator-webhook.sriovnetwork.openshift.io": failed to call webhook: Post "https://operator-webhook-service.openshift-sriov-network-operator.svc:443/validating-custom-resource?timeout=10s": service "operator-webhook-service" not found ``` as the webhook configuration is still present, while the Service and the DaemonSet has been deleted. Delete all the webhook configurations when the user deletes the default SriovOperatorConfig Signed-off-by: Andrea Panattoni <apanatto@redhat.com>

Fix merge annotation function

Signed-off-by: Andrea Panattoni <apanatto@redhat.com>

The bash syntax was incorrect and yielded: hack/env.sh: line 35: ${$RDMA_CNI_IMAGE:-}: bad substitution

Fix syntax for RDMA_CNI_IMAGE var substitution

It might happen that two SR-IOV pods, deployed on different node, are using devices with the same PCI address. In such cases, the query suggested [1] by the sriov-network-metrics-exporter produces the error: ``` Error loading values found duplicate series for the match group {pciAddr="0000:3b:02.4"} on the right hand-side of the operation: [ { __name__="sriov_kubepoddevice", container="test", dev_type="openshift.io/intelnetdevice", endpoint="sriov-network-metrics", instance="10.1.98.60:9110", job="sriov-network-metrics-exporter-service", namespace="cnf-4916", pciAddr="0000:3b:02.4", pod="pod-cnfdr22.telco5g.eng.rdu2.redhat.com", prometheus="openshift-monitoring/k8s", service="sriov-network-metrics-exporter-service" }, { __name__="sriov_kubepoddevice", container="test", dev_type="openshift.io/intelnetdevice", endpoint="sriov-network-metrics", instance="10.1.98.230:9110", job="sriov-network-metrics-exporter-service", namespace="cnf-4916", pciAddr="0000:3b:02.4", pod="pod-dhcp-98-230.telco5g.eng.rdu2.redhat.com", prometheus="openshift-monitoring/k8s", service="sriov-network-metrics-exporter-service" } ];many-to-many matching not allowed: matching labels must be unique on one side ``` Configure the ServiceMonitor resource to add a `node` label to all metrics. The right query to get metrics, as updated in the PrometheusRule, will be: ``` sriov_vf_tx_packets * on (pciAddr,node) group_left(pod,namespace,dev_type) sriov_kubepoddevice ``` Also remove `pod`, `namespace` and `container` label from the `sriov_vf_*` metrics, as they were wrongly set to `sriov-network-metrics-exporter-zj2n9`, `openshift-sriov-network-operator`, `kube-rbac-proxy` [1] https://github.com/k8snetworkplumbingwg/sriov-network-metrics-exporter/blob/0f6a784f377ede87b95f31e569116ceb9775b5b9/README.md?plain=1#L38 Signed-off-by: Andrea Panattoni <apanatto@redhat.com>

When we want to use config-drive in immutable systems, very often the config-drive is only used at boot and then umounted (e.g. ignition does this). Later when we want to fetch Metadata from the config drive, we actually have to mount it. In this PR, I'm adding similar code than coreos/ignition where we dynamically mount the config-drive is the device was found with the right label (config-2 or CONFIG-2 as documented in OpenStack). If the device is found, we mount it, fetch the data and umount it.

Fixes the following shellcheck error: SC2068 (error): Double quote array expansions to avoid re-splitting elements. https://www.shellcheck.net/wiki/SC2068

Fixes the following shellcheck error: SC2148 (error): Tips depend on target shell and yours is unknown. Add a shebang or a 'shell' directive. https://www.shellcheck.net/wiki/SC2148

Fixes the following shellcheck errors: SC2145 (error): Argument mixes string and array. Use * or separate argument. SC2199 (error): Arrays implicitly concatenate in [[ ]]. Use a loop (or explicit * instead of @). https://www.shellcheck.net/wiki/SC2145 https://www.shellcheck.net/wiki/SC2199 Also fixes a typo in SUPPORTED_INTERFACE_SWITCHER_MODES.

Fixes the following shellcheck error: SC2045 (error): Iterating over ls output is fragile. Use globs. https://www.shellcheck.net/wiki/SC2045

On some kernels GetDevlinkDeviceParam may return empty values for some kernel parameters. The netlink library is able to handle this, but the code in GetDevlinkDeviceParam function may panic if unexpected value received. Add extra checks to avoid panics

Delete webhooks when SriovOperatorConfig is deleted

Fix: GetDevlinkDeviceParam to handle edge-cases correctly

[metrics 5/x] Add node label to sriov_* metrics

`sriov_kubepoddevice` metric might end up in the Prometheus database after a while, as the default scrape interval is 30s. This leads to failures in the end-to-end lane like: ``` [sriov] Metrics Exporter When Prometheus operator is available [It] Metrics should have the correct labels /root/opr-ocp2-1/data/sriov-network-operator/sriov-network-operator/test/conformance/tests/test_exporter_metrics.go:132 [FAILED] no value for metric sriov_kubepoddevice ``` Put the metric assertion in an `Eventually` statement Signed-off-by: Andrea Panattoni <apanatto@redhat.com>

Signed-off-by: Sebastian Sch <sebassch@gmail.com>

Fix NRI rbac

Fixes the following shellcheck error: SC2081 (error): [ .. ] can't match globs. Use a case statement. https://www.shellcheck.net/wiki/SC2081

Warns about shellcheck issues with severity `error`.

metrics: Fix `Metrics should have the correct labels` test

CI: Add a bash linter to pre-submits

openstack: dynamically mount the config-drive

When the operator changes the device-plugin Spec (e.g. .Spec.NodeSelector), it may happen that there are two device plugin pods for a given node, one that is terminating, the other that is initializing. If the config-daemon executes `restartDevicePluginPod()` at the same time, it may kill the terminating pod, while the initializing one will run with the old dp configuration. This may cause one or more resources to not being advertised, until a manual device plugin restart occurs. Make the config-daemon restart all the device-plugin instances it founds for its own node. Signed-off-by: Andrea Panattoni <apanatto@redhat.com>

Signed-off-by: Ivan Kolodiazhnyi <ikolodiazhny@nvidia.com>

this is needed because after a reboot on a single node the operator webhook may not be ready Signed-off-by: Sebastian Sch <sebassch@gmail.com>

functest: add retry for rdma functionel test

…SET block When "$SKIP_VAR_SET" is unset and the environment variables fallback to the default, the check for valid values should be done. Move the check out of the $SKIP_VAR_SET block for that. For the current "hack/env.sh" this maybe not make an actual difference, because probably the code to assign default values will ensure that always valid value are set. Note that the openshift variant of the above code will detect the default via skopeo, which can fail. For that reason, this change makes more sense for openshift. However, also for the current code, performing the same error checking after filling out default values, ensures that the detected values are considered valid Even if that is in fact always the case, it's not entirely trivial to see.

[CVE-2024-45338](GHSA-w32m-9786-jp63) Signed-off-by: Andrea Panattoni <apanatto@redhat.com>

bump `golang.org/x/net` to `v0.33.0`

if we run on a system where the PF is not connected to the network we can still use it for tests but we need the link state to not be auto. Signed-off-by: Sebastian Sch <sebassch@gmail.com>

This will fix the issue we sometime see ` <string>: Dump was interrupted and may be inconsistent.\n` https://docs.kernel.org/userspace-api/netlink/intro.html#dump-consistency Signed-off-by: Sebastian Sch <sebassch@gmail.com>

functest: Fix ip link command output

add link state enable on test

It's enouph to configure ib_core module in /etc/moprobe.d/ for Ubuntu OS to change RDMA subsystem mode. Also this commit add OS check into kargs.sh error because 'grubby' isn't available in official Ubuntu repositories. Kernel param configuration support in Ubuntu should be implemented in a separate commit. Signed-off-by: Ivan Kolodiazhnyi <ikolodiazhny@nvidia.com>

Signed-off-by: Sebastian Sch <sebassch@gmail.com>

feat: Update controller logic to handle stale SriovNetworkNodeState CRs with delay

Skip kernel parameters configuration for Ubuntu

Bump k8s version ci

Signed-off-by: Ivan Kolodiazhnyi <ikolodiazhny@nvidia.com>

For using non-default MTU, OVS supports "mtu_request" field when adding a port to the bridge. eg: https://docs.openvswitch.org/en/latest/topics/dpdk/jumbo-frames/ Signed-off-by: Fred Rolland <frolland@nvidia.com>

with the introduction of rdma system mode change on baremetal systems it takes more than 1h that is the default for ginkgo Signed-off-by: Sebastian Sch <sebassch@gmail.com>

extend func-test timeout

[th/hack-env-check] hack/env.sh: move checking of environment variables outside SKIP_VAR_SET block

Support mtu_request for OVS

When creating a bridge with ovs-vsctl, an internal interface is added by default. The same behavior is added in this commit ovs-vsctl code ref: https://github.com/openvswitch/ovs/blob/main/utilities/ovs-vsctl.c#L1597 Signed-off-by: Fred Rolland <frolland@nvidia.com>

ovs: add internal interface

Do not configure BlueField NICs in DPU mode

Signed-off-by: Sebastian Sch <sebassch@gmail.com>

Rdma functional tests improvements

github-actions · 2025-01-31T02:48:49Z

Thanks for your PR,
To run vendors CIs, Maintainers can use one of:

/test-all: To run all tests for all vendors.
/test-e2e-all: To run all E2E tests for all vendors.
/test-e2e-nvidia-all: To run all E2E tests for NVIDIA vendor.

To skip the vendors CIs, Maintainers can use one of:

/skip-all: To skip all tests for all vendors.
/skip-e2e-all: To skip all E2E tests for all vendors.
/skip-e2e-nvidia-all: To skip all E2E tests for NVIDIA vendor.
Best regards.

punkerpunker

🫡

evgenLevin and others added 30 commits September 4, 2024 08:42

Refactor some conformance tests to utilize SRIOV_NODE_AND_DEVICE_NAME…

85feccd

…_FILTER variable

Merge pull request #732 from zeeke/metrics-exporter-rules

aecb4bb

[metrics 4/x] Metrics exporter rules

Merge pull request #771 from zeeke/us/e2e-filter-devices

60c6404

Refactor some conformance tests to use `SRIOV_NODE_AND_DEVICE_NAME_FILTER`

Fix merge annotation function

6aedb8c

if the current obj as annotation and the updated doesn't we still want to add the ones from the current object Signed-off-by: Sebastian Sch <sebassch@gmail.com>

Merge pull request #776 from SchSeba/fix_render

e2d0611

Fix merge annotation function

metrics: Fix typo in METRICS_EXPORTER_PROMETHEUS_DEPLOY_RULES

f17bb2a

Signed-off-by: Andrea Panattoni <apanatto@redhat.com>

Fix syntax for RDMA_CNI_IMAGE var substitution

f94fa64

The bash syntax was incorrect and yielded: hack/env.sh: line 35: ${$RDMA_CNI_IMAGE:-}: bad substitution

Merge pull request #780 from mandre/fix-RDMA_CNI_IMAGE

4bae6ce

Fix syntax for RDMA_CNI_IMAGE var substitution

Enclose array expansions in double quote

ba21df0

Fixes the following shellcheck error: SC2068 (error): Double quote array expansions to avoid re-splitting elements. https://www.shellcheck.net/wiki/SC2068

Add missing shebang

3d553bf

Fixes the following shellcheck error: SC2148 (error): Tips depend on target shell and yours is unknown. Add a shebang or a 'shell' directive. https://www.shellcheck.net/wiki/SC2148

Iterate over globs.

3529811

Fixes the following shellcheck error: SC2045 (error): Iterating over ls output is fragile. Use globs. https://www.shellcheck.net/wiki/SC2045

Merge pull request #779 from zeeke/us/OCPBUGS-41897

6f44ae5

Delete webhooks when SriovOperatorConfig is deleted

Merge pull request #782 from ykulazhenkov/pr-fix-getdevlinkdeviceparam

31175eb

Fix: GetDevlinkDeviceParam to handle edge-cases correctly

Merge pull request #774 from zeeke/metrics-exporter-drop-labels

aecf473

[metrics 5/x] Add node label to sriov_* metrics

Fix NRI rbac

6abdfe6

Signed-off-by: Sebastian Sch <sebassch@gmail.com>

Merge pull request #787 from SchSeba/add_rbac_to_nri

9143c95

Fix NRI rbac

Use grep for matching args with sh

fb193e8

Fixes the following shellcheck error: SC2081 (error): [ .. ] can't match globs. Use a case statement. https://www.shellcheck.net/wiki/SC2081

CI: Add a bash linter to pre-submits

5394d21

Warns about shellcheck issues with severity `error`.

Merge pull request #785 from zeeke/us/metrics-e2e-fix

e35f3d4

metrics: Fix `Metrics should have the correct labels` test

Merge pull request #781 from mandre/shellcheck

c02e517

CI: Add a bash linter to pre-submits

Merge pull request #773 from EmilienM/configDrive

92cf81c

openstack: dynamically mount the config-drive

e0ne and others added 26 commits December 16, 2024 22:36

Do not configure BlueField NICs in DPU mode

60a777c

Signed-off-by: Ivan Kolodiazhnyi <ikolodiazhny@nvidia.com>

functest: add retry for rdma functest

0c4edb3

this is needed because after a reboot on a single node the operator webhook may not be ready Signed-off-by: Sebastian Sch <sebassch@gmail.com>

Merge pull request #817 from SchSeba/add_retry_on_delete

cfd160f

functest: add retry for rdma functionel test

bump golang.org/x/net to v0.33.0

4164d69

[CVE-2024-45338](GHSA-w32m-9786-jp63) Signed-off-by: Andrea Panattoni <apanatto@redhat.com>

Merge pull request #820 from zeeke/us/CVE-2024-45338

eb2fa5f

bump `golang.org/x/net` to `v0.33.0`

add link state enable on test

d7d2e57

if we run on a system where the PF is not connected to the network we can still use it for tests but we need the link state to not be auto. Signed-off-by: Sebastian Sch <sebassch@gmail.com>

functest: Fix ip link command output

260d7eb

This will fix the issue we sometime see ` <string>: Dump was interrupted and may be inconsistent.\n` https://docs.kernel.org/userspace-api/netlink/intro.html#dump-consistency Signed-off-by: Sebastian Sch <sebassch@gmail.com>

Merge pull request #822 from SchSeba/fix_ip_link_command_tests

270d88b

functest: Fix ip link command output

Merge pull request #821 from SchSeba/fix_mtu_disconnected_test

0af8dc5

add link state enable on test

Bump the k8s version we use in the CI system to 1.32.0

d798856

Signed-off-by: Sebastian Sch <sebassch@gmail.com>

Merge pull request #798 from ykulazhenkov/pr-keep-stale-node-state

1ae809e

feat: Update controller logic to handle stale SriovNetworkNodeState CRs with delay

Merge pull request #818 from e0ne/kargs-ubuntu

0ebe7ae

Skip kernel parameters configuration for Ubuntu

Merge pull request #827 from SchSeba/bump_k8s_version_ci

d62b546

Bump k8s version ci

Do not return DPU mode on error

1a8d74c

Signed-off-by: Ivan Kolodiazhnyi <ikolodiazhny@nvidia.com>

Support mtu_request for OVS

e49dac0

For using non-default MTU, OVS supports "mtu_request" field when adding a port to the bridge. eg: https://docs.openvswitch.org/en/latest/topics/dpdk/jumbo-frames/ Signed-off-by: Fred Rolland <frolland@nvidia.com>

extend func-test timeout

688cdde

with the introduction of rdma system mode change on baremetal systems it takes more than 1h that is the default for ginkgo Signed-off-by: Sebastian Sch <sebassch@gmail.com>

Merge pull request #829 from SchSeba/extend_time

009c691

extend func-test timeout

Merge pull request #819 from thom311/th/hack-env-check

220247d

[th/hack-env-check] hack/env.sh: move checking of environment variables outside SKIP_VAR_SET block

Merge pull request #828 from rollandf/mtu-ovs

0860d53

Support mtu_request for OVS

Merge pull request #830 from rollandf/ovs-internal

f891498

ovs: add internal interface

Merge pull request #816 from e0ne/dpu-mode

7990611

Do not configure BlueField NICs in DPU mode

Make rdma functional tests robust for single node environments

009c45f

Signed-off-by: Sebastian Sch <sebassch@gmail.com>

Merge pull request #825 from SchSeba/rdma-functest

aaba54c

Rdma functional tests improvements

punkerpunker self-requested a review January 31, 2025 02:50

punkerpunker approved these changes Jan 31, 2025

View reviewed changes

clarkzinzow merged commit 2e94ce4 into togethercomputer:master Jan 31, 2025
12 of 14 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sync with upstream main #6

Sync with upstream main #6

Uh oh!

clarkzinzow commented Jan 31, 2025

Uh oh!

github-actions bot commented Jan 31, 2025

Uh oh!

punkerpunker left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

16 participants

Sync with upstream main #6

Sync with upstream main #6

Uh oh!

Conversation

clarkzinzow commented Jan 31, 2025

Uh oh!

github-actions bot commented Jan 31, 2025

Uh oh!

punkerpunker left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

16 participants