Re-try to get node object to determine platform time if we fail to get the object the first time. #459

bmorettodama · 2023-06-16T15:44:17Z

With the current configuration, the config-daemon exists and panics if it fails to get the node object used to determine the platform type. We should instead try again before giving up. In managed environments, the apiserver may be undergoing maintenance or other events that make it temporarily unavailable.

github-actions · 2023-06-16T15:44:30Z

Thanks for your PR,
To run vendors CIs use one of:

/test-all: To run all tests for all vendors.
/test-e2e-all: To run all E2E tests for all vendors.
/test-e2e-nvidia-all: To run all E2E tests for NVIDIA vendor.

To skip the vendors CIs use one of:

/skip-all: To skip all tests for all vendors.
/skip-e2e-all: To skip all E2E tests for all vendors.
/skip-e2e-nvidia-all: To skip all E2E tests for NVIDIA vendor.
Best regards.

bmorettodama · 2023-06-16T17:44:31Z

/test-all

SchSeba

/LGTM

adrianchiris · 2023-07-04T08:42:58Z

cmd/sriov-network-config-daemon/start.go

@@ -150,17 +152,34 @@ func runStartCmd(cmd *cobra.Command, args []string) {
 		destdir = "/host/tmp"
 	}

-	platformType := utils.Baremetal
+	backoff := wait.Backoff{


not assuming the platform type makes sense.

i dont like to do retires if its not needed. this get call is generally expected to work. we should panic otherwise, as done in other places here.

We would like to retry if possible, maybe with a shorter timeout if that works better for you. In managed environments, the apiserver may be undergoing maintenance or other events that make it temporarily unavailable, and it would be a better experience if we retry instead of panicing right a way.

I still dont understand why cant this panic then, since its deployed as daemonset, it would restart.

the only reason I can think of this is to not hit and exponential backoff on the daemon-restart from k8s API. where the API will be up but it will takes a lot of time until kubelet will start again the config daemon because it failed many times

in production i would run k8s API with HA so i dont expect this to happen too much

e0ne

Please, update commit message to correspond with change. Whith this PR we still have tha logic to assume baremetal platform by default

github-actions · 2023-07-05T22:26:34Z

Thanks for your PR,
To run vendors CIs use one of:

/test-all: To run all tests for all vendors.
/test-e2e-all: To run all E2E tests for all vendors.
/test-e2e-nvidia-all: To run all E2E tests for NVIDIA vendor.

To skip the vendors CIs use one of:

/skip-all: To skip all tests for all vendors.
/skip-e2e-all: To skip all E2E tests for all vendors.
/skip-e2e-nvidia-all: To skip all E2E tests for NVIDIA vendor.
Best regards.

the object the first time.

github-actions · 2023-07-05T22:28:21Z

Thanks for your PR,
To run vendors CIs use one of:

/test-all: To run all tests for all vendors.
/test-e2e-all: To run all E2E tests for all vendors.
/test-e2e-nvidia-all: To run all E2E tests for NVIDIA vendor.

To skip the vendors CIs use one of:

/skip-all: To skip all tests for all vendors.
/skip-e2e-all: To skip all E2E tests for all vendors.
/skip-e2e-nvidia-all: To skip all E2E tests for NVIDIA vendor.
Best regards.

bmorettodama · 2023-07-05T22:31:09Z

Please, update commit message to correspond with change. Whith this PR we still have tha logic to assume baremetal platform by default

PTAL

adrianchiris · 2023-08-16T14:23:57Z

added this PR to next week's meeting agenda to see if we want to go forward with it.

my take is that its not needed as config daemon would just restart if such a scenario happens.
lets see what other maintainers think of it.

adrianchiris · 2023-08-22T13:49:05Z

Feedback from Today's community meeting:

The preference is to avoid proposed retry logic and panic if getting the node obj is mandatory.
this will trigger config daemon to restart so its expected to eventually recover.

usually if API server in production is deployed in HA configuration so long periods of API server not being available should not happen. We would need a better reason for adding a retry for node obj.

bmorettodama · 2023-08-29T17:41:18Z

Closing this as we can rely on the config daemon restarting if it fails

SchSeba approved these changes Jun 19, 2023

View reviewed changes

adrianchiris reviewed Jul 4, 2023

View reviewed changes

e0ne requested changes Jul 5, 2023

View reviewed changes

bmorettodama force-pushed the master branch from 91890ce to b8752f8 Compare July 5, 2023 22:26

Re-try to get node object to determine platform time if we fail to get

c428f40

the object the first time.

bmorettodama force-pushed the master branch from b8752f8 to c428f40 Compare July 5, 2023 22:28

bmorettodama changed the title ~~Modifying config-daemon to exit if platform type cannot be determined.~~ Re-try to get node object to determine platform time if we fail to get the object the first time. Jul 5, 2023

bmorettodama closed this Aug 29, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Re-try to get node object to determine platform time if we fail to get the object the first time. #459

Re-try to get node object to determine platform time if we fail to get the object the first time. #459

bmorettodama commented Jun 16, 2023 •

edited

Loading

github-actions bot commented Jun 16, 2023

bmorettodama commented Jun 16, 2023

SchSeba left a comment

adrianchiris Jul 4, 2023 •

edited

Loading

bmorettodama Jul 5, 2023

adrianchiris Jul 23, 2023

SchSeba Aug 1, 2023

adrianchiris Aug 2, 2023

e0ne left a comment

github-actions bot commented Jul 5, 2023

github-actions bot commented Jul 5, 2023

bmorettodama commented Jul 5, 2023

adrianchiris commented Aug 16, 2023

adrianchiris commented Aug 22, 2023

bmorettodama commented Aug 29, 2023

Re-try to get node object to determine platform time if we fail to get the object the first time. #459

Re-try to get node object to determine platform time if we fail to get the object the first time. #459

Conversation

bmorettodama commented Jun 16, 2023 • edited Loading

github-actions bot commented Jun 16, 2023

bmorettodama commented Jun 16, 2023

SchSeba left a comment

Choose a reason for hiding this comment

adrianchiris Jul 4, 2023 • edited Loading

Choose a reason for hiding this comment

bmorettodama Jul 5, 2023

Choose a reason for hiding this comment

adrianchiris Jul 23, 2023

Choose a reason for hiding this comment

SchSeba Aug 1, 2023

Choose a reason for hiding this comment

adrianchiris Aug 2, 2023

Choose a reason for hiding this comment

e0ne left a comment

Choose a reason for hiding this comment

github-actions bot commented Jul 5, 2023

github-actions bot commented Jul 5, 2023

bmorettodama commented Jul 5, 2023

adrianchiris commented Aug 16, 2023

adrianchiris commented Aug 22, 2023

bmorettodama commented Aug 29, 2023

bmorettodama commented Jun 16, 2023 •

edited

Loading

adrianchiris Jul 4, 2023 •

edited

Loading