-
Notifications
You must be signed in to change notification settings - Fork 113
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Re-try to get node object to determine platform time if we fail to get the object the first time. #459
Conversation
Thanks for your PR,
To skip the vendors CIs use one of:
|
/test-all |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/LGTM
@@ -150,17 +152,34 @@ func runStartCmd(cmd *cobra.Command, args []string) { | |||
destdir = "/host/tmp" | |||
} | |||
|
|||
platformType := utils.Baremetal | |||
backoff := wait.Backoff{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not assuming the platform type makes sense.
i dont like to do retires if its not needed. this get call is generally expected to work. we should panic otherwise, as done in other places here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We would like to retry if possible, maybe with a shorter timeout if that works better for you. In managed environments, the apiserver may be undergoing maintenance or other events that make it temporarily unavailable, and it would be a better experience if we retry instead of panicing right a way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still dont understand why cant this panic then, since its deployed as daemonset, it would restart.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the only reason I can think of this is to not hit and exponential backoff on the daemon-restart from k8s API. where the API will be up but it will takes a lot of time until kubelet will start again the config daemon because it failed many times
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in production i would run k8s API with HA so i dont expect this to happen too much
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please, update commit message to correspond with change. Whith this PR we still have tha logic to assume baremetal platform by default
Thanks for your PR,
To skip the vendors CIs use one of:
|
the object the first time.
Thanks for your PR,
To skip the vendors CIs use one of:
|
PTAL |
added this PR to next week's meeting agenda to see if we want to go forward with it. my take is that its not needed as config daemon would just restart if such a scenario happens. |
Feedback from Today's community meeting: The preference is to avoid proposed retry logic and panic if getting the node obj is mandatory. usually if API server in production is deployed in HA configuration so long periods of API server not being available should not happen. We would need a better reason for adding a retry for node obj. |
Closing this as we can rely on the config daemon restarting if it fails |
With the current configuration, the config-daemon exists and panics if it fails to get the node object used to determine the platform type. We should instead try again before giving up. In managed environments, the apiserver may be undergoing maintenance or other events that make it temporarily unavailable.