Skip to content
This repository has been archived by the owner on Jul 11, 2023. It is now read-only.

Use go-control-plane's NewServer() and SnapshotCache system to improve ADS #2683

Closed
Tracked by #3702
draychev opened this issue Mar 3, 2021 · 15 comments
Closed
Tracked by #3702
Assignees
Labels
area/envoy-control-plane kind/refactor Related to code refactor priority/P1 P1 priority size/XL 20 days (4 weeks) sprint/1
Milestone

Comments

@draychev
Copy link
Contributor

draychev commented Mar 3, 2021

Leverage go-control-plane's SnapshotCache.

The snapshot cache allows us to set a single-versioned snapshot of all xDS responses (for each proxy) in a single location, while the snapshot cache implementation maintains all state and processes for responding to the actual gRPC requests. It allows us to not worry about the implementation details of delta, vs aggregate updates, streams vs fetch, etc.

From @shashankram: we have had issues in the past with the state logic for responding to proxy Stream Resource requests, and it is very difficult to debug. Offloading this to an upstream implementation should lighten our load substantially

@eduser25
Copy link
Contributor

incoming WIP pr this afternoon!

@draychev draychev modified the milestones: v0.9.0, v0.10.0 May 5, 2021
@eduser25
Copy link
Contributor

btw a working branch for this has been avaialble since immemorial times (eduser25@f4de97b), devel and exploration has been put on hold as there are some shortcomings on the ADS cache implementation of go-control-plane that have to be properly addressed first.

@draychev
Copy link
Contributor Author

@eduser25 could update this issue with details on "here are some shortcomings on the ADS cache implementation of go-control-plane" ?

@eduser25
Copy link
Contributor

context: envoyproxy/go-control-plane#399

@eduser25
Copy link
Contributor

Update:
Initial SnapshotCache implementation support is merged #3530
The issue with go-control-plane is being acknowledge and worked on: envoyproxy/go-control-plane#431

@allenlsy
Copy link
Contributor

allenlsy commented Aug 4, 2022

@steeling Can we use this issue as the root of subtasks of snapshot cache?

@steeling
Copy link
Contributor

@shashankram moving the conversation about the snapshot cache here. Looking at the github description https://github.com/envoyproxy/go-control-plane, I'm not entirely sure the snapshot cache is smart enough to do internal diffs to determine when a change is needed. Instead, we set a new version on each update (even if configs haven't changed), and I believe the snapshot cache will push out the requested changes.

So in that sense the snapshot cache will not reduce the load on the osm-controller. Happy to be wrong about this, but that's my understanding upon some brief reading

@shashankram
Copy link
Member

@shashankram moving the conversation about the snapshot cache here. Looking at the github description https://github.com/envoyproxy/go-control-plane, I'm not entirely sure the snapshot cache is smart enough to do internal diffs to determine when a change is needed. Instead, we set a new version on each update (even if configs haven't changed), and I believe the snapshot cache will push out the requested changes.

So in that sense the snapshot cache will not reduce the load on the osm-controller. Happy to be wrong about this, but that's my understanding upon some brief reading

Yes, I believe that's fine because the Envoy instance will compute the diff anyway. In that regard, the value of using the snapshot cache is to avoid using our controller implementation of the XDS state machine and let the cache deal with it. We've seen quite a few bugs in the past with the XDS state machine implementation in the controller, and using the snapshot cache will simplify what we need to worry about after generating the config. Moreover, I see value in simply writing the config to the cache and forgetting about what happens after that, which isn't the case currently.

@allenlsy allenlsy removed their assignment Aug 18, 2022
@steeling
Copy link
Contributor

@shashankram moving the conversation about the snapshot cache here. Looking at the github description https://github.com/envoyproxy/go-control-plane, I'm not entirely sure the snapshot cache is smart enough to do internal diffs to determine when a change is needed. Instead, we set a new version on each update (even if configs haven't changed), and I believe the snapshot cache will push out the requested changes.
So in that sense the snapshot cache will not reduce the load on the osm-controller. Happy to be wrong about this, but that's my understanding upon some brief reading

Yes, I believe that's fine because the Envoy instance will compute the diff anyway. In that regard, the value of using the snapshot cache is to avoid using our controller implementation of the XDS state machine and let the cache deal with it. We've seen quite a few bugs in the past with the XDS state machine implementation in the controller, and using the snapshot cache will simplify what we need to worry about after generating the config. Moreover, I see value in simply writing the config to the cache and forgetting about what happens after that, which isn't the case currently.

agreed! curious on your opinion on once we migrate to snapshot cache, whether we should keep the messaging.Broker, to know to trigger config generation on changes, or to purely rely on a continuous loop to constantly trigger config generation

@shashankram
Copy link
Member

@shashankram moving the conversation about the snapshot cache here. Looking at the github description https://github.com/envoyproxy/go-control-plane, I'm not entirely sure the snapshot cache is smart enough to do internal diffs to determine when a change is needed. Instead, we set a new version on each update (even if configs haven't changed), and I believe the snapshot cache will push out the requested changes.
So in that sense the snapshot cache will not reduce the load on the osm-controller. Happy to be wrong about this, but that's my understanding upon some brief reading

Yes, I believe that's fine because the Envoy instance will compute the diff anyway. In that regard, the value of using the snapshot cache is to avoid using our controller implementation of the XDS state machine and let the cache deal with it. We've seen quite a few bugs in the past with the XDS state machine implementation in the controller, and using the snapshot cache will simplify what we need to worry about after generating the config. Moreover, I see value in simply writing the config to the cache and forgetting about what happens after that, which isn't the case currently.

agreed! curious on your opinion on once we migrate to snapshot cache, whether we should keep the messaging.Broker, to know to trigger config generation on changes, or to purely rely on a continuous loop to constantly trigger config generation

I don't think we should have a continuous loop to trigger config generation, but rather a combinarion of event driven approach with coalescing and the addition of a periodic reconciler such as Ticker if necessary to recover from transient inconsistencies in the system. Constantly generating config in a tight loop when there's no change in the system has the downside of both controller and Envoy operating in a tight loop all the time at 100% CPU. This is especially evident in scale environments with 1000s of pods.

The broker is really meant to provide a streamlined mechanism for event pub-sub, with additional metrics and checks in place regarding which events are broadcast, multicast, or unicast . The broker is still necessary in that regard given the snapshot cache isn't smart enough to drop events it doesn't need.

@keithmattix
Copy link
Contributor

@steeling Does #5056 close this issue?

@steeling
Copy link
Contributor

steeling commented Sep 7, 2022

Yes it does!

@keithmattix
Copy link
Contributor

Fixed in #5056

Repository owner moved this from In Progress to Done in V1.3 Refactoring Sep 7, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area/envoy-control-plane kind/refactor Related to code refactor priority/P1 P1 priority size/XL 20 days (4 weeks) sprint/1
Projects
Status: Done
Development

Successfully merging a pull request may close this issue.