-
Notifications
You must be signed in to change notification settings - Fork 999
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Increase default lookup_timeout
from 250ms to 5s
#1077
Conversation
Thanks for the contribution! Before we can merge this, we need @maggie44 to sign the Salesforce Inc. Contributor License Agreement. |
Hi @maggie44 - Thanks for the contribution. We discussed this a bit and the reason that this timeout is set lower is so that one bad resolver / address can't hold up the entire DNS sub-routine. You can see here that each DNS address is processed in serial: https://github.com/slackhq/nebula/blob/master/remote_list.go#L124-L135 While it doesn't block the hot path, it could result in a longer period to establish connections to the Lighthouse, if an early address is slow to resolve. Ultimately, this is configurable so that you can increase the timeout if necessary. We're going to leave the default as-is for the time being - that said, if others continue to run into this, we are open to revisiting the default. Cheers! |
Hi @johnmaguire My suggestion is to report an "error" only when all static hosts are unresolvable/unreachable. Otherwise, a "warning" should be enough. I believe the relevant code is around this line: Line 122 in 50b24c1
What are your thoughts on this? |
@SamSirry Hey there, sorry for the delay. For future reference, it's difficult to keep track of questions/feedback in comments of closed issues - it's probably best to file your own issue and we can follow up and close it if necessary. I think you raise a good point that the log level here may be too high. I plan on raising this with the rest of the team next week to get their thoughts. |
I think the proposal by @SamSirry was to not log an error if at least 1 lighthouse is reached. Rather than to change the log level. I don't see how either changing the log level or only reporting errors if all lighthouses are unreachable would make any sense. A timeout is an unexpected failure of a part of the system. There is no doubt it is an error and errors should be made visible and actioned rather than hidden. Just because another lighthouse is working doesn't mean it will forever, an error should signify a lack of a redundancy. |
@maggie44 I understood the proposal to be changing the log level to a warning unless all DNS names for a given static host were unresolvable, in which case an error should be logged. (e.g. an internal DNS name only available on LAN and an external public DNS name) I agree that any single unreachable static host is an error - it is not only used for lighthouses. It is possible to run a Lighthouse-less network through the static_host_map, for example. |
I get a lot of errors in the logs like these, particularly when starting the service for the first time:
When using multiple lighthouses some of them resolve ok, others timeout, and eventually they all resolve on future loops. The DNS resolution is working but the timeout is sometimes being reached before it has a time to finish.
The timeout for these requests is current set at
250ms
. This is extremely low and can't see any reason why.Here are some example defaults from elsewhere for some precedent:
https://github.com/istio/istio/blob/ac901c3ed1a2455705709bd5e81df781d7a63083/pilot/pkg/util/network/ip.go#L145
https://github.com/tailscale/tailscale/blob/a4a909a20b0f868de4870294e200e803f61589f7/ipn/localapi/debugderp.go#L161
This PR raises the default timeout to 5s.