Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Increase default lookup_timeout from 250ms to 5s #1077

Closed
wants to merge 1 commit into from

Conversation

maggie44
Copy link

I get a lot of errors in the logs like these, particularly when starting the service for the first time:

ERRO[0000] DNS resolution failed for static_map host     error="lookup example.com: i/o timeout" hostname=example.com network=ip4

When using multiple lighthouses some of them resolve ok, others timeout, and eventually they all resolve on future loops. The DNS resolution is working but the timeout is sometimes being reached before it has a time to finish.

The timeout for these requests is current set at 250ms. This is extremely low and can't see any reason why.

Here are some example defaults from elsewhere for some precedent:

https://github.com/istio/istio/blob/ac901c3ed1a2455705709bd5e81df781d7a63083/pilot/pkg/util/network/ip.go#L145
https://github.com/tailscale/tailscale/blob/a4a909a20b0f868de4870294e200e803f61589f7/ipn/localapi/debugderp.go#L161

This PR raises the default timeout to 5s.

Copy link

Thanks for the contribution! Before we can merge this, we need @maggie44 to sign the Salesforce Inc. Contributor License Agreement.

@wadey wadey added this to the v1.9.0 milestone Mar 18, 2024
@wadey wadey added the NeedsDecision Feedback is required from experts, contributors, and/or the community before a change can be made. label Mar 18, 2024
@johnmaguire
Copy link
Collaborator

Hi @maggie44 -

Thanks for the contribution. We discussed this a bit and the reason that this timeout is set lower is so that one bad resolver / address can't hold up the entire DNS sub-routine.

You can see here that each DNS address is processed in serial: https://github.com/slackhq/nebula/blob/master/remote_list.go#L124-L135

While it doesn't block the hot path, it could result in a longer period to establish connections to the Lighthouse, if an early address is slow to resolve.

Ultimately, this is configurable so that you can increase the timeout if necessary. We're going to leave the default as-is for the time being - that said, if others continue to run into this, we are open to revisiting the default.

Cheers!

@johnmaguire johnmaguire closed this Apr 1, 2024
@wadey wadey removed this from the v1.9.0 milestone Apr 30, 2024
@SamSirry
Copy link

Hi @johnmaguire
I came here after noticing numerous error entries in the journal for "DNS resolution failed for static_map host".
While I can adjust the timeout value in the config file (static_map.lookup_timeout), my concern is that an unharmful timeout event still appears in the log as an "error", and it's still cluttering the log while the logging level is set to "error", which draws unnecessary attention.

My suggestion is to report an "error" only when all static hosts are unresolvable/unreachable. Otherwise, a "warning" should be enough.

I believe the relevant code is around this line:

for {

What are your thoughts on this?

@johnmaguire
Copy link
Collaborator

@SamSirry Hey there, sorry for the delay. For future reference, it's difficult to keep track of questions/feedback in comments of closed issues - it's probably best to file your own issue and we can follow up and close it if necessary.

I think you raise a good point that the log level here may be too high. I plan on raising this with the rest of the team next week to get their thoughts.

@maggie44
Copy link
Author

maggie44 commented May 31, 2024

I think the proposal by @SamSirry was to not log an error if at least 1 lighthouse is reached. Rather than to change the log level.

I don't see how either changing the log level or only reporting errors if all lighthouses are unreachable would make any sense. A timeout is an unexpected failure of a part of the system. There is no doubt it is an error and errors should be made visible and actioned rather than hidden. Just because another lighthouse is working doesn't mean it will forever, an error should signify a lack of a redundancy.

@johnmaguire
Copy link
Collaborator

@maggie44 I understood the proposal to be changing the log level to a warning unless all DNS names for a given static host were unresolvable, in which case an error should be logged. (e.g. an internal DNS name only available on LAN and an external public DNS name)

I agree that any single unreachable static host is an error - it is not only used for lighthouses. It is possible to run a Lighthouse-less network through the static_host_map, for example.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla:signed NeedsDecision Feedback is required from experts, contributors, and/or the community before a change can be made.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants