Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

All images use 'latest` when installing v1.4.0 version #784

Closed
jslouisyou opened this issue Oct 7, 2024 · 5 comments
Closed

All images use 'latest` when installing v1.4.0 version #784

jslouisyou opened this issue Oct 7, 2024 · 5 comments

Comments

@jslouisyou
Copy link

Hello,

When I try to deploy sriov-network-operator latest version (e.g. v1.4.0), it seems that sriov-network-operator creates sriov-network-config-daemon Pods accordingly but I can find that all images within that Pod uses latest tag (Tag is omitted but I heard that if this tag has been omitted then latest will be used by default, please let me know if I knew wrong).

I can find that image tags aren't assigned inside the script:

  1. Helm chart

    images:
    operator: ghcr.io/k8snetworkplumbingwg/sriov-network-operator
    sriovConfigDaemon: ghcr.io/k8snetworkplumbingwg/sriov-network-operator-config-daemon
    sriovCni: ghcr.io/k8snetworkplumbingwg/sriov-cni
    ibSriovCni: ghcr.io/k8snetworkplumbingwg/ib-sriov-cni
    ovsCni: ghcr.io/k8snetworkplumbingwg/ovs-cni-plugin
    rdmaCni: ghcr.io/k8snetworkplumbingwg/rdma-cni
    sriovDevicePlugin: ghcr.io/k8snetworkplumbingwg/sriov-network-device-plugin
    resourcesInjector: ghcr.io/k8snetworkplumbingwg/network-resources-injector
    webhook: ghcr.io/k8snetworkplumbingwg/sriov-network-operator-webhook
    metricsExporter: ghcr.io/k8snetworkplumbingwg/sriov-network-metrics-exporter
    metricsExporterKubeRbacProxy: gcr.io/kubebuilder/kube-rbac-proxy:v0.15.0

  2. Shell script

    if [ -z $SKIP_VAR_SET ]; then
    export SRIOV_CNI_IMAGE=${SRIOV_CNI_IMAGE:-ghcr.io/k8snetworkplumbingwg/sriov-cni}
    export SRIOV_INFINIBAND_CNI_IMAGE=${SRIOV_INFINIBAND_CNI_IMAGE:-ghcr.io/k8snetworkplumbingwg/ib-sriov-cni}
    # OVS_CNI_IMAGE can be explicitly set to empty value, use default only if the var is not set
    export OVS_CNI_IMAGE=${OVS_CNI_IMAGE-ghcr.io/k8snetworkplumbingwg/ovs-cni-plugin}
    # RDMA_CNI_IMAGE can be explicitly set to empty value, use default only if the var is not set
    export RDMA_CNI_IMAGE=${RDMA_CNI_IMAGE-ghcr.io/k8snetworkplumbingwg/rdma-cni}
    export SRIOV_DEVICE_PLUGIN_IMAGE=${SRIOV_DEVICE_PLUGIN_IMAGE:-ghcr.io/k8snetworkplumbingwg/sriov-network-device-plugin}
    export NETWORK_RESOURCES_INJECTOR_IMAGE=${NETWORK_RESOURCES_INJECTOR_IMAGE:-ghcr.io/k8snetworkplumbingwg/network-resources-injector}
    export SRIOV_NETWORK_CONFIG_DAEMON_IMAGE=${SRIOV_NETWORK_CONFIG_DAEMON_IMAGE:-ghcr.io/k8snetworkplumbingwg/sriov-network-operator-config-daemon}
    export SRIOV_NETWORK_WEBHOOK_IMAGE=${SRIOV_NETWORK_WEBHOOK_IMAGE:-ghcr.io/k8snetworkplumbingwg/sriov-network-operator-webhook}
    export METRICS_EXPORTER_IMAGE=${METRICS_EXPORTER_IMAGE:-ghcr.io/k8snetworkplumbingwg/sriov-network-metrics-exporter}
    export SRIOV_NETWORK_OPERATOR_IMAGE=${SRIOV_NETWORK_OPERATOR_IMAGE:-ghcr.io/k8snetworkplumbingwg/sriov-network-operator}
    export METRICS_EXPORTER_KUBE_RBAC_PROXY_IMAGE=${METRICS_EXPORTER_KUBE_RBAC_PROXY_IMAGE:-gcr.io/kubebuilder/kube-rbac-proxy:v0.15.0}

But in v1.2.0, image tags were set:

if [ -z $SKIP_VAR_SET ]; then
if ! skopeo -v &> /dev/null
then
echo "skopeo could not be found"
exit 1
fi
CNI_IMAGE_DIGEST=$(skopeo inspect docker://quay.io/openshift/origin-sriov-cni | jq --raw-output '.Digest')
export SRIOV_CNI_IMAGE=${SRIOV_CNI_IMAGE:-quay.io/openshift/origin-sriov-cni@${CNI_IMAGE_DIGEST}}
INFINIBAND_CNI_IMAGE_DIGEST=$(skopeo inspect docker://quay.io/openshift/origin-sriov-infiniband-cni | jq --raw-output '.Digest')
export SRIOV_INFINIBAND_CNI_IMAGE=${SRIOV_INFINIBAND_CNI_IMAGE:-quay.io/openshift/origin-sriov-infiniband-cni@${INFINIBAND_CNI_IMAGE_DIGEST}}
DP_IMAGE_DIGEST=$(skopeo inspect docker://quay.io/openshift/origin-sriov-network-device-plugin | jq --raw-output '.Digest')
export SRIOV_DEVICE_PLUGIN_IMAGE=${SRIOV_DEVICE_PLUGIN_IMAGE:-quay.io/openshift/origin-sriov-network-device-plugin@${DP_IMAGE_DIGEST}}
INJECTOR_IMAGE_DIGEST=$(skopeo inspect docker://quay.io/openshift/origin-sriov-dp-admission-controller | jq --raw-output '.Digest')
export NETWORK_RESOURCES_INJECTOR_IMAGE=${NETWORK_RESOURCES_INJECTOR_IMAGE:-quay.io/openshift/origin-sriov-dp-admission-controller@${INJECTOR_IMAGE_DIGEST}}
DAEMON_IMAGE_DIGEST=$(skopeo inspect docker://quay.io/openshift/origin-sriov-network-config-daemon | jq --raw-output '.Digest')
export SRIOV_NETWORK_CONFIG_DAEMON_IMAGE=${SRIOV_NETWORK_CONFIG_DAEMON_IMAGE:-quay.io/openshift/origin-sriov-network-config-daemon@${DAEMON_IMAGE_DIGEST}}
WEBHOOK_IMAGE_DIGEST=$(skopeo inspect docker://quay.io/openshift/origin-sriov-network-webhook | jq --raw-output '.Digest')
export SRIOV_NETWORK_WEBHOOK_IMAGE=${SRIOV_NETWORK_WEBHOOK_IMAGE:-quay.io/openshift/origin-sriov-network-webhook@${WEBHOOK_IMAGE_DIGEST}}
OPERATOR_IMAGE_DIGEST=$(skopeo inspect docker://quay.io/openshift/origin-sriov-network-operator | jq --raw-output '.Digest')
export SRIOV_NETWORK_OPERATOR_IMAGE=${SRIOV_NETWORK_OPERATOR_IMAGE:-quay.io/openshift/origin-sriov-network-operator@${OPERATOR_IMAGE_DIGEST}}

In this case, Is it intended to use latest image for all containers?
If not, could you please proper tags for all images?

Thanks.

@jslouisyou
Copy link
Author

Hi, I could pull Helm chart like:
helm pull oci://ghcr.io/k8snetworkplumbingwg/sriov-network-operator-chart --version 1.4.0
and it seems that image tags for all containers are all set.

images:
  operator: ghcr.io/k8snetworkplumbingwg/sriov-network-operator:v1.4.0
  sriovConfigDaemon: ghcr.io/k8snetworkplumbingwg/sriov-network-operator-config-daemon:v1.4.0
  sriovCni: ghcr.io/k8snetworkplumbingwg/sriov-cni:v2.8.1
  ibSriovCni: ghcr.io/k8snetworkplumbingwg/ib-sriov-cni:v1.1.1
  ovsCni: ghcr.io/k8snetworkplumbingwg/ovs-cni-plugin:v0.34.2
  rdmaCni: ghcr.io/k8snetworkplumbingwg/rdma-cni:v1.2.0
  sriovDevicePlugin: ghcr.io/k8snetworkplumbingwg/sriov-network-device-plugin:v3.7.0
  resourcesInjector: ghcr.io/k8snetworkplumbingwg/network-resources-injector:v1.6.0
  webhook: ghcr.io/k8snetworkplumbingwg/sriov-network-operator-webhook:v1.4.0
  metricsExporter: ghcr.io/k8snetworkplumbingwg/sriov-network-metrics-exporter:v1.1.0
  metricsExporterKubeRbacProxy: gcr.io/kubebuilder/kube-rbac-proxy:v0.15.0

I think I can use this images for v1.4.0.
Could you please confirm that above images are setup correctly?

@zeeke
Copy link
Member

zeeke commented Oct 8, 2024

hi @jslouisyou , I think the point here is that we can no longer deploy a tag by checking out the source code.
Since the helm package is deployed when tagging, I think having the helm pull ... command is enough for the job.

Images look correct to me.

Are you experiencing any other issue during the deploy?

@jslouisyou
Copy link
Author

Hi @zeeke , I'm facing an issue while creating VFs in v1.4.0 version - IB devices disappears at the end of VF creation (It works in v1.3.0 btw).
First of all, I think below comment is quite different from this thread so please let me know if I need to create another issue then.

I used same configuration (e.g. SriovNetworkNodePolicy) for creating VFs.

Here's SriovNetworkNodePolicy that I used:

apiVersion: sriovnetwork.openshift.io/v1
kind: SriovNetworkNodePolicy
metadata:
  name: policy-gpu2-ib2
  namespace: sriov-network-operator
spec:
  isRdma: true
  linkType: ib
  nicSelector:
    deviceID: "1021"
    pfNames:
    - ibp157s0
    vendor: 15b3
  nodeSelector:
    node-role.kubernetes.io/gpu: ""
  numVfs: 8
  priority: 10
  resourceName: gpu2_mlnx_ib2
---
apiVersion: sriovnetwork.openshift.io/v1
kind: SriovNetworkNodePolicy
metadata:
  name: policy-gpu2-ib3
  namespace: sriov-network-operator
spec:
  isRdma: true
  linkType: ib
  nicSelector:
    deviceID: "1021"
    pfNames:
    - ibp211s0
    vendor: 15b3
  nodeSelector:
    node-role.kubernetes.io/gpu: ""
  numVfs: 8
  priority: 10
  resourceName: gpu2_mlnx_ib3

And I'm using H100 node with ConnectX-7 IB:

$ mst status -v
MST modules:
------------
    MST PCI module is not loaded
    MST PCI configuration module loaded
PCI devices:
------------
DEVICE_TYPE             MST                           PCI       RDMA            NET                       NUMA  
ConnectX7(rev:0)        /dev/mst/mt4129_pciconf5      e5:00.0   mlx5_5          net-ibp229s0              1  
ConnectX7(rev:0)        /dev/mst/mt4129_pciconf4      d3:00.0   mlx5_4          net-ibp211s0              1  
ConnectX7(rev:0)        /dev/mst/mt4129_pciconf3      c1:00.0   mlx5_3          net-ibp193s0              1  
ConnectX7(rev:0)        /dev/mst/mt4129_pciconf2      9d:00.0   mlx5_2          net-ibp157s0              1  
ConnectX7(rev:0)        /dev/mst/mt4129_pciconf1      54:00.0   mlx5_1          net-ibp84s0               0   
ConnectX7(rev:0)        /dev/mst/mt4129_pciconf0      41:00.0   mlx5_0          net-ibp65s0               0

$ lspci -s 41:00.0 -vvn
41:00.0 0207: 15b3:1021
	Subsystem: 15b3:0041
	Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
	Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
	Latency: 0
	Interrupt: pin A routed to IRQ 18
	NUMA node: 0
	Region 0: Memory at 23e044000000 (64-bit, prefetchable) [size=32M]
	Expansion ROM at <ignored> [disabled]
	Capabilities: [60] Express (v2) Endpoint, MSI 00
		DevCap:	MaxPayload 512 bytes, PhantFunc 0, Latency L0s unlimited, L1 unlimited
			ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset+ SlotPowerLimit 25.000W
		DevCtl:	CorrErr+ NonFatalErr+ FatalErr+ UnsupReq+
			RlxdOrd+ ExtTag+ PhantFunc- AuxPwr- NoSnoop+ FLReset-
			MaxPayload 256 bytes, MaxReadReq 4096 bytes
		DevSta:	CorrErr- NonFatalErr- FatalErr- UnsupReq- AuxPwr- TransPend-
		LnkCap:	Port #0, Speed 32GT/s, Width x16, ASPM not supported
			ClockPM- Surprise- LLActRep- BwNot- ASPMOptComp+
		LnkCtl:	ASPM Disabled; RCB 64 bytes, Disabled- CommClk-
			ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
		LnkSta:	Speed 32GT/s (ok), Width x16 (ok)
			TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
		DevCap2: Completion Timeout: Range ABC, TimeoutDis+ NROPrPrP- LTR-
			 10BitTagComp+ 10BitTagReq+ OBFF Not Supported, ExtFmt- EETLPPrefix-
			 EmergencyPowerReduction Not Supported, EmergencyPowerReductionInit-
			 FRS- TPHComp- ExtTPHComp-
			 AtomicOpsCap: 32bit+ 64bit+ 128bitCAS+
		DevCtl2: Completion Timeout: 65ms to 210ms, TimeoutDis- LTR- OBFF Disabled,
			 AtomicOpsCtl: ReqEn+
		LnkCap2: Supported Link Speeds: 2.5-32GT/s, Crosslink- Retimer+ 2Retimers+ DRS-
		LnkCtl2: Target Link Speed: 32GT/s, EnterCompliance- SpeedDis-
			 Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
			 Compliance De-emphasis: -6dB
		LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete+ EqualizationPhase1+
			 EqualizationPhase2+ EqualizationPhase3+ LinkEqualizationRequest-
			 Retimer- 2Retimers- CrosslinkRes: unsupported
	Capabilities: [48] Vital Product Data
		Product Name: Nvidia ConnectX-7 Single Port Infiniband NDR OSFP Adapter
		Read-only fields:
			[PN] Part number: 0RYMTY
			[EC] Engineering changes: A02
			[MN] Manufacture ID: 1028
			[SN] Serial number: IN0RYMTYJBNM43BRJ4KF
			[VA] Vendor specific: DSV1028VPDR.VER2.1
			[VB] Vendor specific: FFV28.39.10.02
			[VC] Vendor specific: NPY1
			[VD] Vendor specific: PMTD
			[VE] Vendor specific: NMVNvidia, Inc.
			[VH] Vendor specific: L1D0
			[VU] Vendor specific: IN0RYMTYJBNM43BRJ4KFMLNXS0D0F0 
			[RV] Reserved: checksum good, 0 byte(s) reserved
		End
	Capabilities: [9c] MSI-X: Enable+ Count=64 Masked-
		Vector table: BAR=0 offset=00002000
		PBA: BAR=0 offset=00003000
	Capabilities: [c0] Vendor Specific Information: Len=18 <?>
	Capabilities: [40] Power Management version 3
		Flags: PMEClk- DSI- D1- D2- AuxCurrent=375mA PME(D0-,D1-,D2-,D3hot-,D3cold+)
		Status: D0 NoSoftRst+ PME-Enable+ DSel=0 DScale=0 PME-
	Capabilities: [100 v1] Advanced Error Reporting
		UESta:	DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
		UEMsk:	DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt+ RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
		UESvrt:	DLP+ SDES- TLP+ FCP+ CmpltTO+ CmpltAbrt+ UnxCmplt- RxOF+ MalfTLP+ ECRC+ UnsupReq- ACSViol-
		CESta:	RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr+
		CEMsk:	RxErr+ BadTLP+ BadDLLP+ Rollover+ Timeout+ AdvNonFatalErr+
		AERCap:	First Error Pointer: 04, ECRCGenCap+ ECRCGenEn- ECRCChkCap+ ECRCChkEn-
			MultHdrRecCap- MultHdrRecEn- TLPPfxPres- HdrLogCap-
		HeaderLog: 00000000 00000000 00000000 00000000
	Capabilities: [150 v1] Alternative Routing-ID Interpretation (ARI)
		ARICap:	MFVC- ACS-, Next Function: 0
		ARICtl:	MFVC- ACS-, Function Group: 0
	Capabilities: [1c0 v1] Secondary PCI Express
		LnkCtl3: LnkEquIntrruptEn- PerformEqu-
		LaneErrStat: 0
	Capabilities: [320 v1] Lane Margining at the Receiver <?>
	Capabilities: [370 v1] Physical Layer 16.0 GT/s <?>
	Capabilities: [3b0 v1] Extended Capability ID 0x2a
	Capabilities: [420 v1] Data Link Feature <?>
	Kernel driver in use: mlx5_core
	Kernel modules: mlx5_core

And I pulled v1.3.0 and v1.4.0 Helm charts from oci://ghcr.io/k8snetworkplumbingwg/sriov-network-operator-chart and image tags are different:

  1. v1.3.0
images:
  operator: ghcr.io/k8snetworkplumbingwg/sriov-network-operator:v1.3.0
  sriovConfigDaemon: ghcr.io/k8snetworkplumbingwg/sriov-network-operator-config-daemon:v1.3.0
  sriovCni: ghcr.io/k8snetworkplumbingwg/sriov-cni:v2.8.0
  ibSriovCni: ghcr.io/k8snetworkplumbingwg/ib-sriov-cni:v1.1.1
  ovsCni: ghcr.io/k8snetworkplumbingwg/ovs-cni-plugin:v0.34.0
  sriovDevicePlugin: ghcr.io/k8snetworkplumbingwg/sriov-network-device-plugin:v3.7.0
  resourcesInjector: ghcr.io/k8snetworkplumbingwg/network-resources-injector:v1.6.0
  webhook: ghcr.io/k8snetworkplumbingwg/sriov-network-operator-webhook:v1.3.0
  1. v1.4.0
images:
  operator: ghcr.io/k8snetworkplumbingwg/sriov-network-operator:v1.4.0
  sriovConfigDaemon: ghcr.io/k8snetworkplumbingwg/sriov-network-operator-config-daemon:v1.4.0
  sriovCni: ghcr.io/k8snetworkplumbingwg/sriov-cni:v2.8.1
  ibSriovCni: ghcr.io/k8snetworkplumbingwg/ib-sriov-cni:v1.1.1
  ovsCni: ghcr.io/k8snetworkplumbingwg/ovs-cni-plugin:v0.34.2
  rdmaCni: ghcr.io/k8snetworkplumbingwg/rdma-cni:v1.2.0
  sriovDevicePlugin: ghcr.io/k8snetworkplumbingwg/sriov-network-device-plugin:v3.7.0
  resourcesInjector: ghcr.io/k8snetworkplumbingwg/network-resources-injector:v1.6.0
  webhook: ghcr.io/k8snetworkplumbingwg/sriov-network-operator-webhook:v1.4.0
  metricsExporter: ghcr.io/k8snetworkplumbingwg/sriov-network-metrics-exporter:v1.1.0
  metricsExporterKubeRbacProxy: gcr.io/kubebuilder/kube-rbac-proxy:v0.15.0

As you know, sriov-device-plugin pods are creating when SriovNetworkNodePolicy deployed.
After then, my H100 nodes' status are changed from sriovnetwork.openshift.io/state: Idle to sriovnetwork.openshift.io/state: Reboot_Required and rebooted after elapsed some time.

But in v1.4.0, it seems that VFs were created but eventually these were not shown and even PF disappeared. Here's the logs from dmesg:

[  115.692158] pci 0000:41:00.1: [15b3:101e] type 00 class 0x020700
[  115.692321] pci 0000:41:00.1: enabling Extended Tags
[  115.694112] mlx5_core 0000:41:00.1: enabling device (0000 -> 0002)
[  115.694789] mlx5_core 0000:41:00.1: firmware version: 28.39.1002
[  115.867939] mlx5_core 0000:41:00.1: MLX5E: StrdRq(0) RqSz(1024) StrdSz(256) RxCqeCmprss(0)
[  115.867943] mlx5_core 0000:41:00.1: MLX5E: StrdRq(0) RqSz(1024) StrdSz(256) RxCqeCmprss(0)
[  115.892812] pci 0000:41:00.2: [15b3:101e] type 00 class 0x020700
[  115.892967] pci 0000:41:00.2: enabling Extended Tags
[  115.894706] mlx5_core 0000:41:00.2: enabling device (0000 -> 0002)
[  115.895344] mlx5_core 0000:41:00.2: firmware version: 28.39.1002
[  115.895423] mlx5_core 0000:41:00.1 ibp65s0v0: renamed from ib0
[  116.065557] mlx5_core 0000:41:00.2: MLX5E: StrdRq(0) RqSz(1024) StrdSz(256) RxCqeCmprss(0)
[  116.065561] mlx5_core 0000:41:00.2: MLX5E: StrdRq(0) RqSz(1024) StrdSz(256) RxCqeCmprss(0)
[  116.090478] pci 0000:41:00.3: [15b3:101e] type 00 class 0x020700
[  116.090634] pci 0000:41:00.3: enabling Extended Tags
[  116.093559] mlx5_core 0000:41:00.3: enabling device (0000 -> 0002)
[  116.093993] mlx5_core 0000:41:00.2 ibp65s0v1: renamed from ib0
[  116.094189] mlx5_core 0000:41:00.3: firmware version: 28.39.1002
[  116.293582] mlx5_core 0000:41:00.3: MLX5E: StrdRq(0) RqSz(1024) StrdSz(256) RxCqeCmprss(0)
[  116.293587] mlx5_core 0000:41:00.3: MLX5E: StrdRq(0) RqSz(1024) StrdSz(256) RxCqeCmprss(0)
[  116.318209] pci 0000:41:00.4: [15b3:101e] type 00 class 0x020700
[  116.318368] pci 0000:41:00.4: enabling Extended Tags
[  116.320079] mlx5_core 0000:41:00.4: enabling device (0000 -> 0002)
[  116.320712] mlx5_core 0000:41:00.4: firmware version: 28.39.1002
[  116.320871] mlx5_core 0000:41:00.3 ibp65s0v2: renamed from ib0
.....
[  446.036867] mlx5_core 0000:41:01.0 ibp65s0v7: renamed from ib0
[  446.464555] mlx5_core 0000:41:00.0: mlx5_wait_for_pages:898:(pid 6868): Skipping wait for vf pages stage
[  448.848149] mlx5_core 0000:41:00.0: driver left SR-IOV enabled after remove                                               <----------- weird
[  449.108562] mlx5_core 0000:41:00.2: poll_health:955:(pid 0): Fatal error 3 detected
[  449.108602] mlx5_core 0000:41:00.4: poll_health:955:(pid 0): Fatal error 3 detected
[  449.108620] mlx5_core 0000:41:00.2: mlx5_health_try_recover:375:(pid 1478): handling bad device here
[  449.108627] mlx5_core 0000:41:00.2: mlx5_handle_bad_state:326:(pid 1478): starting teardown
[  449.108629] mlx5_core 0000:41:00.2: mlx5_error_sw_reset:277:(pid 1478): start
[  449.108646] mlx5_core 0000:41:00.4: mlx5_health_try_recover:375:(pid 2283): handling bad device here
[  449.108660] mlx5_core 0000:41:00.4: mlx5_handle_bad_state:326:(pid 2283): starting teardown
[  449.108661] mlx5_core 0000:41:00.4: mlx5_error_sw_reset:277:(pid 2283): start
[  449.108672] mlx5_core 0000:41:00.2: mlx5_error_sw_reset:310:(pid 1478): end
[  449.108694] mlx5_core 0000:41:00.4: mlx5_error_sw_reset:310:(pid 2283): end
[  449.876577] mlx5_core 0000:41:00.5: poll_health:955:(pid 0): Fatal error 3 detected
[  449.876642] mlx5_core 0000:41:00.5: mlx5_health_try_recover:375:(pid 1000): handling bad device here
[  449.876649] mlx5_core 0000:41:00.5: mlx5_handle_bad_state:326:(pid 1000): starting teardown
[  449.876651] mlx5_core 0000:41:00.5: mlx5_error_sw_reset:277:(pid 1000): start
[  449.877266] mlx5_core 0000:41:00.5: mlx5_error_sw_reset:310:(pid 1000): end
[  450.381036] mlx5_core 0000:41:00.2: mlx5_health_try_recover:381:(pid 1478): starting health recovery flow

After then, when I tried to execute mst status -v then even the node can't find PF itself:

$ mst status -v
MST modules:
------------
    MST PCI module is not loaded
    MST PCI configuration module loaded
PCI devices:
------------
DEVICE_TYPE             MST                           PCI       RDMA            NET                       NUMA  
ConnectX7(rev:0)        /dev/mst/mt4129_pciconf5      e5:00.0   mlx5_5          net-ibp229s0              1     
ConnectX7(rev:0)        /dev/mst/mt4129_pciconf4      d3:00.0                             1                                  <---- it goes empty
ConnectX7(rev:0)        /dev/mst/mt4129_pciconf3      c1:00.0   mlx5_3          net-ibp193s0              1     
ConnectX7(rev:0)        /dev/mst/mt4129_pciconf2      9d:00.0                             1                                  <---- it goes empty     
ConnectX7(rev:0)        /dev/mst/mt4129_pciconf1      54:00.0   mlx5_1          net-ibp84s0               0     
ConnectX7(rev:0)        /dev/mst/mt4129_pciconf0      41:00.0   mlx5_0          net-ibp65s0               0 

Do you know anything about this situation? Anything would be very helpful.

Thanks.

@zeeke
Copy link
Member

zeeke commented Oct 8, 2024

Yes, please pack all this information in a new issue. It will help other users to better find the information

@jslouisyou
Copy link
Author

Thanks @zeeke . I'll wrap this up and raise a new issue then.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants