Increase default `lookup_timeout` from 250ms to 5s #1077

maggie44 · 2024-02-19T19:57:03Z

I get a lot of errors in the logs like these, particularly when starting the service for the first time:

ERRO[0000] DNS resolution failed for static_map host     error="lookup example.com: i/o timeout" hostname=example.com network=ip4

When using multiple lighthouses some of them resolve ok, others timeout, and eventually they all resolve on future loops. The DNS resolution is working but the timeout is sometimes being reached before it has a time to finish.

The timeout for these requests is current set at 250ms. This is extremely low and can't see any reason why.

Here are some example defaults from elsewhere for some precedent:

https://github.com/istio/istio/blob/ac901c3ed1a2455705709bd5e81df781d7a63083/pilot/pkg/util/network/ip.go#L145
https://github.com/tailscale/tailscale/blob/a4a909a20b0f868de4870294e200e803f61589f7/ipn/localapi/debugderp.go#L161

This PR raises the default timeout to 5s.

salesforce-cla · 2024-02-19T19:57:07Z

Thanks for the contribution! Before we can merge this, we need @maggie44 to sign the Salesforce Inc. Contributor License Agreement.

johnmaguire · 2024-04-01T18:22:21Z

Hi @maggie44 -

Thanks for the contribution. We discussed this a bit and the reason that this timeout is set lower is so that one bad resolver / address can't hold up the entire DNS sub-routine.

You can see here that each DNS address is processed in serial: https://github.com/slackhq/nebula/blob/master/remote_list.go#L124-L135

While it doesn't block the hot path, it could result in a longer period to establish connections to the Lighthouse, if an early address is slow to resolve.

Ultimately, this is configurable so that you can increase the timeout if necessary. We're going to leave the default as-is for the time being - that said, if others continue to run into this, we are open to revisiting the default.

Cheers!

SamSirry · 2024-05-24T02:06:49Z

Hi @johnmaguire
I came here after noticing numerous error entries in the journal for "DNS resolution failed for static_map host".
While I can adjust the timeout value in the config file (static_map.lookup_timeout), my concern is that an unharmful timeout event still appears in the log as an "error", and it's still cluttering the log while the logging level is set to "error", which draws unnecessary attention.

My suggestion is to report an "error" only when all static hosts are unresolvable/unreachable. Otherwise, a "warning" should be enough.

I believe the relevant code is around this line:

nebula/remote_list.go

Line 122 in 50b24c1

for {

What are your thoughts on this?

johnmaguire · 2024-05-31T19:13:20Z

@SamSirry Hey there, sorry for the delay. For future reference, it's difficult to keep track of questions/feedback in comments of closed issues - it's probably best to file your own issue and we can follow up and close it if necessary.

I think you raise a good point that the log level here may be too high. I plan on raising this with the rest of the team next week to get their thoughts.

maggie44 · 2024-05-31T21:48:15Z

I think the proposal by @SamSirry was to not log an error if at least 1 lighthouse is reached. Rather than to change the log level.

I don't see how either changing the log level or only reporting errors if all lighthouses are unreachable would make any sense. A timeout is an unexpected failure of a part of the system. There is no doubt it is an error and errors should be made visible and actioned rather than hidden. Just because another lighthouse is working doesn't mean it will forever, an error should signify a lack of a redundancy.

johnmaguire · 2024-05-31T21:52:08Z

@maggie44 I understood the proposal to be changing the log level to a warning unless all DNS names for a given static host were unresolvable, in which case an error should be logged. (e.g. an internal DNS name only available on LAN and an external public DNS name)

I agree that any single unreachable static host is an error - it is not only used for lighthouses. It is possible to run a Lighthouse-less network through the static_host_map, for example.

Increase default lookup_timeout from 250ms to 5s

dce8888

salesforce-cla bot added the cla:missing label Feb 19, 2024

salesforce-cla bot added cla:signed and removed cla:missing labels Feb 19, 2024

wadey added this to the v1.9.0 milestone Mar 18, 2024

wadey added the NeedsDecision Feedback is required from experts, contributors, and/or the community before a change can be made. label Mar 18, 2024

johnmaguire closed this Apr 1, 2024

wadey removed this from the v1.9.0 milestone Apr 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Increase default `lookup_timeout` from 250ms to 5s #1077

Increase default `lookup_timeout` from 250ms to 5s #1077

maggie44 commented Feb 19, 2024

salesforce-cla bot commented Feb 19, 2024

johnmaguire commented Apr 1, 2024

SamSirry commented May 24, 2024

johnmaguire commented May 31, 2024

maggie44 commented May 31, 2024 •

edited

Loading

johnmaguire commented May 31, 2024

Increase default lookup_timeout from 250ms to 5s #1077

Increase default lookup_timeout from 250ms to 5s #1077

Conversation

maggie44 commented Feb 19, 2024

salesforce-cla bot commented Feb 19, 2024

johnmaguire commented Apr 1, 2024

SamSirry commented May 24, 2024

johnmaguire commented May 31, 2024

maggie44 commented May 31, 2024 • edited Loading

johnmaguire commented May 31, 2024

Increase default `lookup_timeout` from 250ms to 5s #1077

Increase default `lookup_timeout` from 250ms to 5s #1077

maggie44 commented May 31, 2024 •

edited

Loading