Use go-control-plane's NewServer() and SnapshotCache system to improve ADS #2683

draychev · 2021-03-03T04:49:04Z

Leverage go-control-plane's SnapshotCache.

The snapshot cache allows us to set a single-versioned snapshot of all xDS responses (for each proxy) in a single location, while the snapshot cache implementation maintains all state and processes for responding to the actual gRPC requests. It allows us to not worry about the implementation details of delta, vs aggregate updates, streams vs fetch, etc.

From @shashankram: we have had issues in the past with the state logic for responding to proxy Stream Resource requests, and it is very difficult to debug. Offloading this to an upstream implementation should lighten our load substantially

eduser25 · 2021-04-20T20:11:47Z

incoming WIP pr this afternoon!

eduser25 · 2021-05-12T23:00:00Z

btw a working branch for this has been avaialble since immemorial times (eduser25@f4de97b), devel and exploration has been put on hold as there are some shortcomings on the ADS cache implementation of go-control-plane that have to be properly addressed first.

draychev · 2021-05-20T23:48:04Z

@eduser25 could update this issue with details on "here are some shortcomings on the ADS cache implementation of go-control-plane" ?

eduser25 · 2021-05-21T18:54:40Z

context: envoyproxy/go-control-plane#399

eduser25 · 2021-06-11T19:09:28Z

Update:
Initial SnapshotCache implementation support is merged #3530
The issue with go-control-plane is being acknowledge and worked on: envoyproxy/go-control-plane#431

allenlsy · 2022-08-04T16:09:12Z

@steeling Can we use this issue as the root of subtasks of snapshot cache?

steeling · 2022-08-17T20:53:28Z

@shashankram moving the conversation about the snapshot cache here. Looking at the github description https://github.com/envoyproxy/go-control-plane, I'm not entirely sure the snapshot cache is smart enough to do internal diffs to determine when a change is needed. Instead, we set a new version on each update (even if configs haven't changed), and I believe the snapshot cache will push out the requested changes.

So in that sense the snapshot cache will not reduce the load on the osm-controller. Happy to be wrong about this, but that's my understanding upon some brief reading

shashankram · 2022-08-17T21:21:58Z

@shashankram moving the conversation about the snapshot cache here. Looking at the github description https://github.com/envoyproxy/go-control-plane, I'm not entirely sure the snapshot cache is smart enough to do internal diffs to determine when a change is needed. Instead, we set a new version on each update (even if configs haven't changed), and I believe the snapshot cache will push out the requested changes.

So in that sense the snapshot cache will not reduce the load on the osm-controller. Happy to be wrong about this, but that's my understanding upon some brief reading

Yes, I believe that's fine because the Envoy instance will compute the diff anyway. In that regard, the value of using the snapshot cache is to avoid using our controller implementation of the XDS state machine and let the cache deal with it. We've seen quite a few bugs in the past with the XDS state machine implementation in the controller, and using the snapshot cache will simplify what we need to worry about after generating the config. Moreover, I see value in simply writing the config to the cache and forgetting about what happens after that, which isn't the case currently.

steeling · 2022-08-22T19:11:21Z

@shashankram moving the conversation about the snapshot cache here. Looking at the github description https://github.com/envoyproxy/go-control-plane, I'm not entirely sure the snapshot cache is smart enough to do internal diffs to determine when a change is needed. Instead, we set a new version on each update (even if configs haven't changed), and I believe the snapshot cache will push out the requested changes.
So in that sense the snapshot cache will not reduce the load on the osm-controller. Happy to be wrong about this, but that's my understanding upon some brief reading

Yes, I believe that's fine because the Envoy instance will compute the diff anyway. In that regard, the value of using the snapshot cache is to avoid using our controller implementation of the XDS state machine and let the cache deal with it. We've seen quite a few bugs in the past with the XDS state machine implementation in the controller, and using the snapshot cache will simplify what we need to worry about after generating the config. Moreover, I see value in simply writing the config to the cache and forgetting about what happens after that, which isn't the case currently.

agreed! curious on your opinion on once we migrate to snapshot cache, whether we should keep the messaging.Broker, to know to trigger config generation on changes, or to purely rely on a continuous loop to constantly trigger config generation

shashankram · 2022-08-22T21:04:51Z

@shashankram moving the conversation about the snapshot cache here. Looking at the github description https://github.com/envoyproxy/go-control-plane, I'm not entirely sure the snapshot cache is smart enough to do internal diffs to determine when a change is needed. Instead, we set a new version on each update (even if configs haven't changed), and I believe the snapshot cache will push out the requested changes.
So in that sense the snapshot cache will not reduce the load on the osm-controller. Happy to be wrong about this, but that's my understanding upon some brief reading

Yes, I believe that's fine because the Envoy instance will compute the diff anyway. In that regard, the value of using the snapshot cache is to avoid using our controller implementation of the XDS state machine and let the cache deal with it. We've seen quite a few bugs in the past with the XDS state machine implementation in the controller, and using the snapshot cache will simplify what we need to worry about after generating the config. Moreover, I see value in simply writing the config to the cache and forgetting about what happens after that, which isn't the case currently.

agreed! curious on your opinion on once we migrate to snapshot cache, whether we should keep the messaging.Broker, to know to trigger config generation on changes, or to purely rely on a continuous loop to constantly trigger config generation

I don't think we should have a continuous loop to trigger config generation, but rather a combinarion of event driven approach with coalescing and the addition of a periodic reconciler such as Ticker if necessary to recover from transient inconsistencies in the system. Constantly generating config in a tight loop when there's no change in the system has the downside of both controller and Envoy operating in a tight loop all the time at 100% CPU. This is especially evident in scale environments with 1000s of pods.

The broker is really meant to provide a streamlined mechanism for event pub-sub, with additional metrics and checks in place regarding which events are broadcast, multicast, or unicast . The broker is still necessary in that regard given the snapshot cache isn't smart enough to drop events it doesn't need.

keithmattix · 2022-09-07T18:39:25Z

@steeling Does #5056 close this issue?

steeling · 2022-09-07T19:50:44Z

Yes it does!

keithmattix · 2022-09-07T21:39:53Z

Fixed in #5056

draychev added the improvement / feature request label Mar 3, 2021

draychev linked a pull request Mar 3, 2021 that will close this issue

experimental: Using go-control-plane NewServer() #2682

Closed

draychev mentioned this issue Mar 30, 2021

Milestone v0.9.0 planning #2611

Closed

draychev added size/XL 20 days (4 weeks) category:xds_v3 labels Mar 30, 2021

draychev assigned eduser25, nojnhuh and michelleN and unassigned eduser25 and nojnhuh Mar 31, 2021

draychev added this to the v0.9.0 milestone Apr 1, 2021

draychev modified the milestones: v0.9.0, v0.10.0 May 5, 2021

draychev removed the category:xds_v3 label May 20, 2021

draychev mentioned this issue May 21, 2021

Milestone v0.10.0 planning #2800

Closed

draychev unassigned michelleN Jun 22, 2021

draychev added area/envoy-control-plane and removed improvement / feature request labels Jun 22, 2021

draychev mentioned this issue Jun 30, 2021

(root) Robust Envoy Control Plane #3702

Closed

4 tasks

draychev modified the milestones: v0.10.0, vNext Aug 19, 2021

draychev unassigned eduser25 Aug 19, 2021

trstringer added this to OSM Roadmap (dev) Jul 19, 2022

jaellio added kind/refactor Related to code refactor and removed kind/needed labels Jul 22, 2022

allenlsy self-assigned this Jul 22, 2022

steeling mentioned this issue Jul 26, 2022

Collapse informer event handlers into the informer collection. #4938

Closed

allenlsy moved this to Todo in OSM Roadmap (dev) Aug 1, 2022

steeling mentioned this issue Aug 3, 2022

begin snapshot cache implementation. #4963

Merged

steeling mentioned this issue Aug 4, 2022

Listen to single pod updates in the ADS Server's watchForUpdates method #4968

Closed

allenlsy removed their assignment Aug 18, 2022

steeling added this to V1.3 Refactoring Aug 22, 2022

steeling removed this from OSM Roadmap (dev) Aug 22, 2022

steeling moved this to Todo in V1.3 Refactoring Aug 22, 2022

steeling mentioned this issue Aug 23, 2022

Complete ADS snapshot cache implementation #5017

Merged

steeling moved this from Todo to In Progress in V1.3 Refactoring Aug 24, 2022

steeling self-assigned this Aug 24, 2022

trstringer added sprint/1 sprint/current labels Aug 24, 2022

trstringer added the sprint/2 label Sep 7, 2022

keithmattix closed this as completed Sep 7, 2022

Repository owner moved this from In Progress to Done in V1.3 Refactoring Sep 7, 2022

keithmattix removed sprint/current sprint/2 labels Sep 12, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use go-control-plane's NewServer() and SnapshotCache system to improve ADS #2683

Use go-control-plane's NewServer() and SnapshotCache system to improve ADS #2683

draychev commented Mar 3, 2021 •

edited by steeling

Loading

eduser25 commented Apr 20, 2021

eduser25 commented May 12, 2021

draychev commented May 20, 2021

eduser25 commented May 21, 2021

eduser25 commented Jun 11, 2021

allenlsy commented Aug 4, 2022

steeling commented Aug 17, 2022

shashankram commented Aug 17, 2022

steeling commented Aug 22, 2022

shashankram commented Aug 22, 2022

keithmattix commented Sep 7, 2022

steeling commented Sep 7, 2022

keithmattix commented Sep 7, 2022

Use go-control-plane's NewServer() and SnapshotCache system to improve ADS #2683

Use go-control-plane's NewServer() and SnapshotCache system to improve ADS #2683

Comments

draychev commented Mar 3, 2021 • edited by steeling Loading

eduser25 commented Apr 20, 2021

eduser25 commented May 12, 2021

draychev commented May 20, 2021

eduser25 commented May 21, 2021

eduser25 commented Jun 11, 2021

allenlsy commented Aug 4, 2022

steeling commented Aug 17, 2022

shashankram commented Aug 17, 2022

steeling commented Aug 22, 2022

shashankram commented Aug 22, 2022

keithmattix commented Sep 7, 2022

steeling commented Sep 7, 2022

keithmattix commented Sep 7, 2022

draychev commented Mar 3, 2021 •

edited by steeling

Loading