Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[sotw][linear] Fix missing watch cleanup in linear cache for sotw watches subscribing to multiple resources #854

Closed

Conversation

valerian-roche
Copy link
Contributor

When using the linear cache in sotw mode, there are multiple issues when using non-wildcard requests
This PR addresses two issues potentially impactful:

  • the request provided to callbacks and the servers is a fake one not including the actual data. This is confusing and can generate issues (e.g. the node section is nil while user might expect it set when using the simple cache)
  • if the request is not wildcard, the cache stores the watch for each requested resource, but only clear it from the updated resources, leaving stale channels potentially full. If two updates come prior to the server consuming the channel or closing it fully, the code will block on the channel under the mutex and the server calling CreateWatch will also deadlock. As this mutex is common to all nodes the entire cache would then be deadlocked. The current implementation of sotw server makes it unlikely except if multiple updates are sent to the cache (e.g. iterating on UpdateResource instead of using UpdateResources). Any server implementation which would consider a watch properly cleaned if triggered (which is the initial design of the cache interface) would leak the channel and potentially trigger a deadlock later on

Other issues are not yet tackled in this PR as they depend for some on changes in the cache interface proposed in other PRs

  • properly trigger watches when a new resource is requested in the request but the version is unchanged
  • support for explicit wildcard request
    Other issues to be addressed in other PRs
  • resource ordering when using Mux Cache in ADS mode
  • full updates or partial updates in sotw depending on resource type (lds/cds with full update, partial for rds/eds) and not on cache/protocol type (current implementation)

…ches subscribing to multiple resources

Properly return the request in sotw responses to allow proper handling in callbacks

Signed-off-by: Valerian Roche <valerian.roche@datadoghq.com>
@@ -140,18 +140,18 @@ func (cache *LinearCache) respond(value chan Response, staleResources []string)

func (cache *LinearCache) notifyAll(modified map[string]struct{}) {
// de-duplicate watches that need to be responded
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment threw me off a bit when reading the code. It's not de-duplicating iiuc (there are no duplicates), it is constructing a mapping of watches -> list of resources names, no?

For linear cache/SOTW, here the list of strings in the mapping should simply be request.ResourceNames, no? Maybe it'd be clearer to just use this, since you have the request at heand in the ResponseWatch?

Copy link
Contributor Author

@valerian-roche valerian-roche Jan 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is deduplicating through the map to send a simple reply with n resources rather than n replies with one
On whether to use request.ResourceNames, this would yield a different behavior as:

  • this is not normalized for wildcard parts for instance
  • can target resources which don't exist in the cache, and the respond implementation might not expect that
  • for eds/rds, we don't want to return all resources but only the ones matching modified

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it makes a difference in theory, the request contains all the resources you may need to respond to, so

  1. for CDS/LDS and wildcard you need to send all resources for this type
  2. for RDS/EDS/* you need the intersection of resource names in modified and what is in request.ResourceNames

}
for value := range cache.watchAll {
cache.respond(value, nil)
for watch := range cache.watchAll {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So watchAll are wildcards, right? I think it would be more clear to simply call that watchWildcards

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can update that in a separate PR

@atollena
Copy link

atollena commented Jan 9, 2024

Other issues are not yet tackled in this PR as they depend for some on changes in the cache interface proposed in other PRs

properly trigger watches when a new resource is requested in the request but the version is unchanged
support for explicit wildcard request
Other issues to be addressed in other PRs
resource ordering when using Mux Cache in ADS mode
full updates or partial updates in sotw depending on resource type (lds/cds with full update, partial for rds/eds) and not on cache/protocol type (current implementation)

It would be nice to have tracking issues for all those things that you know don't actually work.

@valerian-roche
Copy link
Contributor Author

valerian-roche commented Jan 10, 2024

Other issues are not yet tackled in this PR as they depend for some on changes in the cache interface proposed in other PRs

properly trigger watches when a new resource is requested in the request but the version is unchanged
support for explicit wildcard request
Other issues to be addressed in other PRs
resource ordering when using Mux Cache in ADS mode
full updates or partial updates in sotw depending on resource type (lds/cds with full update, partial for rds/eds) and not on cache/protocol type (current implementation)

It would be nice to have tracking issues for all those things that you know don't actually work.

properly trigger watches when a new resource is requested in the request but the version is unchanged

#608

support for explicit wildcard request

#855

resource ordering when using Mux Cache in ADS mode

No issues for now. We need to review if we really want to support this as this would require a major rework of the mux cache

full updates or partial updates in sotw depending on resource type (lds/cds with full update, partial for rds/eds) and not on cache/protocol type (current implementation)

#540

@atollena
Copy link

atollena commented Jan 10, 2024

Thank you for creating the issues.

resource ordering when using Mux Cache in ADS mode

No issues for now. We need to review if we really want to support this as this would require a major rework of the mux cache

This should be documented alongside MuxCache at https://github.com/valerian-roche/go-control-plane/blob/a36df793fffbbbedb91c00b29f03fc4aa5b43f5e/pkg/cache/v3/mux.go#L22-L27.

For users it's possible to use a single cache if they need ordering, as iiuc a single linear cache can handle multiple resource types. Actually linear documents that it is for a single resource type, and at least the wildcard handling isn't per type so. So it's not possible to use linear cache and benefit from ADS resource ordering guarantees. I don't think this causes issues with gRPC, not sure about Envoy.

@valerian-roche
Copy link
Contributor Author

Actually linear documents that it is for a single resource type, and at least the wildcard handling isn't per type so. So it's not possible to use linear cache and benefit from ADS resource ordering guarantees. I don't think this causes issues with gRPC, not sure about Envoy.

Currently when using Linear users will set resources in the different caches at different times, so we can hardly take ordering decisions in watches as they'll trigger based on the user Update ordering. Using a Mux with multiple linear caches with envoy ADS will likely create issues if the user does not insert data in specific order, but if the order is respected it should work

Copy link

github-actions bot commented Feb 9, 2024

This pull request has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in 7 days if no further activity occurs. Please feel free to give a status update now, ping for review, or re-open when it's ready. Thank you for your contributions!

@github-actions github-actions bot added the stale label Feb 9, 2024
Copy link

This pull request has been automatically closed because it has not had activity in the last 37 days. Please feel free to give a status update now, ping for review, or re-open when it's ready. Thank you for your contributions!

@github-actions github-actions bot closed this Feb 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants