Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Re-try to get node object to determine platform time if we fail to get the object the first time. #459

Closed

Conversation

bmorettodama
Copy link

@bmorettodama bmorettodama commented Jun 16, 2023

With the current configuration, the config-daemon exists and panics if it fails to get the node object used to determine the platform type. We should instead try again before giving up. In managed environments, the apiserver may be undergoing maintenance or other events that make it temporarily unavailable.

@github-actions
Copy link

Thanks for your PR,
To run vendors CIs use one of:

  • /test-all: To run all tests for all vendors.
  • /test-e2e-all: To run all E2E tests for all vendors.
  • /test-e2e-nvidia-all: To run all E2E tests for NVIDIA vendor.

To skip the vendors CIs use one of:

  • /skip-all: To skip all tests for all vendors.
  • /skip-e2e-all: To skip all E2E tests for all vendors.
  • /skip-e2e-nvidia-all: To skip all E2E tests for NVIDIA vendor.
    Best regards.

@bmorettodama
Copy link
Author

/test-all

Copy link
Collaborator

@SchSeba SchSeba left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/LGTM

@@ -150,17 +152,34 @@ func runStartCmd(cmd *cobra.Command, args []string) {
destdir = "/host/tmp"
}

platformType := utils.Baremetal
backoff := wait.Backoff{
Copy link
Collaborator

@adrianchiris adrianchiris Jul 4, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not assuming the platform type makes sense.

i dont like to do retires if its not needed. this get call is generally expected to work. we should panic otherwise, as done in other places here.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We would like to retry if possible, maybe with a shorter timeout if that works better for you. In managed environments, the apiserver may be undergoing maintenance or other events that make it temporarily unavailable, and it would be a better experience if we retry instead of panicing right a way.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still dont understand why cant this panic then, since its deployed as daemonset, it would restart.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the only reason I can think of this is to not hit and exponential backoff on the daemon-restart from k8s API. where the API will be up but it will takes a lot of time until kubelet will start again the config daemon because it failed many times

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in production i would run k8s API with HA so i dont expect this to happen too much

Copy link
Collaborator

@e0ne e0ne left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please, update commit message to correspond with change. Whith this PR we still have tha logic to assume baremetal platform by default

@github-actions
Copy link

github-actions bot commented Jul 5, 2023

Thanks for your PR,
To run vendors CIs use one of:

  • /test-all: To run all tests for all vendors.
  • /test-e2e-all: To run all E2E tests for all vendors.
  • /test-e2e-nvidia-all: To run all E2E tests for NVIDIA vendor.

To skip the vendors CIs use one of:

  • /skip-all: To skip all tests for all vendors.
  • /skip-e2e-all: To skip all E2E tests for all vendors.
  • /skip-e2e-nvidia-all: To skip all E2E tests for NVIDIA vendor.
    Best regards.

@github-actions
Copy link

github-actions bot commented Jul 5, 2023

Thanks for your PR,
To run vendors CIs use one of:

  • /test-all: To run all tests for all vendors.
  • /test-e2e-all: To run all E2E tests for all vendors.
  • /test-e2e-nvidia-all: To run all E2E tests for NVIDIA vendor.

To skip the vendors CIs use one of:

  • /skip-all: To skip all tests for all vendors.
  • /skip-e2e-all: To skip all E2E tests for all vendors.
  • /skip-e2e-nvidia-all: To skip all E2E tests for NVIDIA vendor.
    Best regards.

@bmorettodama bmorettodama changed the title Modifying config-daemon to exit if platform type cannot be determined. Re-try to get node object to determine platform time if we fail to get the object the first time. Jul 5, 2023
@bmorettodama
Copy link
Author

Please, update commit message to correspond with change. Whith this PR we still have tha logic to assume baremetal platform by default

PTAL

@adrianchiris
Copy link
Collaborator

added this PR to next week's meeting agenda to see if we want to go forward with it.

my take is that its not needed as config daemon would just restart if such a scenario happens.
lets see what other maintainers think of it.

@adrianchiris
Copy link
Collaborator

Feedback from Today's community meeting:

The preference is to avoid proposed retry logic and panic if getting the node obj is mandatory.
this will trigger config daemon to restart so its expected to eventually recover.

usually if API server in production is deployed in HA configuration so long periods of API server not being available should not happen. We would need a better reason for adding a retry for node obj.

@bmorettodama
Copy link
Author

Closing this as we can rely on the config daemon restarting if it fails

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants