Watcher infrastructure for remote clusters and arbitrary target kinds #201

squaremo · 2023-10-11T17:14:20Z

Description

This PR fills in the gaps deliberately left in #200. That is, it adds support for targets in remote clusters, and targets with arbitrary APIVersion and Kind. After this change, the controller maintains a set of Kubernetes API client caches -- one for each {cluster, type} (where type is the group, version, and kind of the target) -- which serve both for querying target status, and being informed of updates to targets.

Most of the substance of this change is behind watchTargetAndGetReader, which is the entry point to using the caches. Its contract is that it will both 1. supply a client which can read the given target; and 2. make sure the target has a watch looking at it, so that when the target changes, the pipeline using it can be re-examined.

This involves a few new pieces of machinery:

keeping track of all the caches that have been created, so we don't create a new one each time we're asked;
running caches (they need their own goroutines) and gracefully shutting them down;
dismantling caches that are no longer needed;
figuring out which pipeline a target change pertains to, so it can be queued for reconciliation.

Individual commit messages (especially e738f65 and f980554), and comments in the code, explain how these work in detail.

The changes to the reconciliation code are small -- basically,

use caches.watchTargetAndGetReader to obtain a client for the right cluster, when examining a target (this replaces getClusterClient, which was introduced as a stub in Minimal level-triggered controller #200 to make room for this change)
fetch unstructured.Unstructured (dynamic client) values rather than API package typed values to represent target objects, so we can target arbitrary types without needing their API packages at compile time
calculate the target status from those same unstructured.Unstructured values, which is a bit more laborious but supports any type with the right fields

There are a few incidental changes that are consequence of updating modules so that I can import Cluster API; and, extra files in config/testdata/crds so as to be able to test Kustomization objects and arbitrary types as targets.

Verifying it works

I added tests that demonstrate that

you can specify a target in a remote cluster and it all still works
when no pipeline needs a cache any more, it is torn down
pipelines can target Kustomization objects (the kind of target that's not HelmRelease)

Once this and the promotion algorithm (#203) have been merged, or perhaps when one is rebased on the other, we will be ready to actually try it.

Effect of this PR

Closes #196.

Checklist (from #196)

expand tests for application object watching as in
Prototype the indexing/watching behaviour needed for level-triggered promotions #194, to cover targets in remote clusters (https://github.com/weaveworks/pipeline-controller/pull/201/files#diff-b4246e2660e154bca730e39f47e67d55b6de0808e1f42749c66706a352337382)
expand support to any kind used as a target (rather than just HelmRelease) (https://github.com/weaveworks/pipeline-controller/pull/201/files#diff-0fc7f99d2100465b440a3576c489294e217fc4543b4c45b2823fe982ef3b9567R122 tests for Kustomization; the controller does not import those types, so the implication is that it supports arbitrary types. For more evidence I could add a dummy CRD and run the same sort of test case.)
tests for the invariant that only clusters mentioned in pipelines have client caches (actually {cluster, type}, as it turned out; see https://github.com/weaveworks/pipeline-controller/pull/201/files#diff-13a7e7873027858765b0dd7ef78743cbdbc0667cbf22865d3eb4236e4b162375R23)
tests for how it deals with unready clusters (already in the test suite; see https://github.com/weaveworks/pipeline-controller/pull/201/files#diff-0fc7f99d2100465b440a3576c489294e217fc4543b4c45b2823fe982ef3b9567R41)

controllers/leveltriggered/gc_test.go

luizbafilho · 2023-10-26T16:20:48Z

controllers/leveltriggered/caching.go

+
+		// having done all that, did we really need it?
+		c.cachesMu.Lock()
+		if cacheEntry, cacheFound = c.cachesMap[cacheKey]; !cacheFound {


is it your question about getting the cache entry again? It's not clear why you are doing that.

Here I'm trying to avoid holding the lock while doing I/O. So, it checks whether the new cache is needed, and if so, releases the lock before going off to find cluster secrets with which to construct a client. The lock is reacquired in this line, and since this might be interleaved with some other process constructing caches, the presence of a cache is checked again before the map is written to.

Alternatives are

hold the lock for an indefinite period while tracking down a client config;

using a RWLock and "upgrading" from a read lock to a write lock if necessary

Of these, 2. is pretty close to what's implemented, and might be a useful refinement (it would mean there's no exclusive lock for the "hot" path, where there's already a cache). 1.) is dubious because it would mean locking the whole map exclusively while doing network I/O, during which no other work can make progress. I think anything other than those two has more complication for no particular benefit, but I can be persuaded :-)

Do you think a clearer comment is needed @luizbafilho ?

that's fine. those calls can potentially take a long time.

luizbafilho

That looks great, there is a lot of new stuff, I've never used, but it was good to learn new tricks.

No asks, just minor questions

This is not quite the most recent version of controller-runtime, but it's recent enough, and it's the version compatible with sig.k8s.io/cluster-api, which I would like to introduce. In newer releases, there are some problems with kubeyaml's openapi package that I don't want to deal with right now. The changes to patch up are: - the builder package uses `Watch(client.Object, ...)` rather than `Watch(source.Source, ...)` - `handler.MapFunc` now takes a `context.Context` as the first arguments, so you don't have to use `context.Background` There's also some change to how Kubernetes events get stringified, which broke a test (in notification_test.go). Signed-off-by: Michael Bridgen <michael.bridgen@weave.works>

The tests make their own `PipelineReconciler` struct, but there's a constructor `NewPipelineReconciler` which could do some initialisation, and that will missed. So: use the constructor in suite-test.go. However! This now fails to initialise the eventRecorder, leaving it to default in `SetupWithManager`, and capturing events is part of the tests. I don't want to disturb existing behaviour*, so I've kept this change to the level-triggered controller. *How does it change? The event recorder is constructed using a notification-controller URL and used _only_ for promotion notifications. The event recorder for things that happen in the reconciler is left to default, which means it won't send those to the notification-controller. In practice it's unlikely to matter, since any alerts set up should filter for the promotion events. Signed-off-by: Michael Bridgen <michael.bridgen@weave.works>

If you use the Gomega or *testing.T from the outer scope, passes and fails are not reported against individual tests correctly. Signed-off-by: Michael Bridgen <michael.bridgen@weave.works>

This commit adds the infrastructure for watching and querying arbitrarily-typed pipeline targets, in remote clusters as well as the local cluster. The basic shape is this: for each target that needs to be examined, the reconciler uses `watchTargetAndGetReader(..., target)`. This procedure encapsulates the detail of making sure there's a cache for the target's cluster and type, and supplies the client.Reader needed for fetching the target object. A `cache.Cache` is kept for each {cluster, type}. `cache.Cache` is the smallest piece of machinery that can be torn down, because the next layer down, `Informer` objects, can't be removed once created. This is important for being able to stop watching targets when they are no longer targets. Target object updates will come from all the caches, which come and (in principle) go; but, the handler must be statically installed in SetupWithManager(). So, targets are looked up in an index to get the corresponding pipeline (if there is one), and that pipeline is put into a `source.Channel`. The channel source multiplexes the dynamic event handlers into a static pipeline requeue handler. NB: * I've put the remote cluster test in its own Test* wrapper, because it needs to start another testenv to be the remote cluster. * Supporting arbitrary types means using `unstructured.Unstructured` when querying for target objects, and this complicates checking their status. Since the caches are per-type, in theory there could be code for uerying known types (HelmRelease and Kustomize), with `Unstructured` as a fallback. So long at the object passed to `watchTargetAndGetReader(...) is the same one used with client.Get(...), it should all work. * A cache per {cluster, type} is not the only possible scheme. The watching could be more precise -- meaning fewer spurious events, and narrower permissions needed -- by having a cache per {cluster, namespace, type}, with the trade-off being managing more goroutines, and other overheads. I've chosen the chunkier scheme based on an informed guess that it'll be more efficient for low numbers of clusters and targets.

This makes room for indexing targets for a different purpose -- garbage collection of caches. Signed-off-by: Michael Bridgen <michael.bridgen@weave.works>

It's possible that due to pipelines disappearing, or being updated, some caches will no longer be needed. If these are not shut down, the number of caches will only grow, which constitutes a leak of resources (though not necessarily a serious one, since it will max out at `clusters x types`). To be able to shut down caches that are no longer needed, we need to be able to do a few things: 1. detect when they aren't needed 2. stop them running when not needed 3. stop them when the controller is shutting down To do the first, I index the cache keys used by each pipeline. The garbage collector regularly checks to see if each cache has entries in the index; and if not, it's not used by any pipeline and can be shut down. To keep track of caches to consider for collection, the GC uses a rate-limiting work queue. When the cache is created, it's put on the queue; and each time it's considered and is still needed, it's requeued with a longer retry, up to about eight minutes. This avoids the question of finding an appropriate event to hook into, with the downside of being a bit eventual. The second and third things can be arranged by deriving contexts from the manager's context. I have introduced `runner` (in runner.go) which can be Start()ed by the manager and thus gain access to its context, and which can then construct a context for each cache. Each cache gets its own cancel func that can be used to shut it down, but will also be shut down by the manager when it's shutting down itself. Signed-off-by: Michael Bridgen <michael.bridgen@weave.works>

All the bits to do with client caches can go together, since (after being constructed) it only interacts with the reconciler through `watchTargetAndGetReader` and via events. This is mainly a case of relocating the relevant fields and changing some variable names. Signed-off-by: Michael Bridgen <michael.bridgen@weave.works>

squaremo · 2023-11-01T15:34:48Z

Thanks for reviewing, Luiz ⭐

squaremo self-assigned this Oct 11, 2023

squaremo added enhancement area/pipelines team/wild-watermelon labels Oct 11, 2023

squaremo force-pushed the watcher-infrastructure branch 3 times, most recently from 5a5c11a to a4b3af3 Compare October 18, 2023 16:21

squaremo force-pushed the watcher-infrastructure branch 3 times, most recently from 4a08cad to 8a4f108 Compare October 24, 2023 15:55

squaremo marked this pull request as ready for review October 24, 2023 16:14

squaremo requested a review from luizbafilho October 24, 2023 16:14

luizbafilho reviewed Oct 26, 2023

View reviewed changes

controllers/leveltriggered/gc_test.go Outdated Show resolved Hide resolved

luizbafilho reviewed Oct 26, 2023

View reviewed changes

luizbafilho approved these changes Oct 26, 2023

View reviewed changes

squaremo force-pushed the watcher-infrastructure branch 3 times, most recently from e3641cb to 4af195c Compare November 1, 2023 12:06

squaremo added 7 commits November 1, 2023 12:09

squaremo force-pushed the watcher-infrastructure branch from 4af195c to faa331d Compare November 1, 2023 12:10

squaremo merged commit 0d01a61 into main Nov 1, 2023

squaremo deleted the watcher-infrastructure branch November 1, 2023 15:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Watcher infrastructure for remote clusters and arbitrary target kinds #201

Watcher infrastructure for remote clusters and arbitrary target kinds #201

squaremo commented Oct 11, 2023 •

edited

Loading

Uh oh!

Uh oh!

luizbafilho Oct 26, 2023

Uh oh!

squaremo Nov 1, 2023

Uh oh!

squaremo Nov 1, 2023

Uh oh!

luizbafilho Nov 1, 2023

Uh oh!

luizbafilho left a comment

Uh oh!

squaremo commented Nov 1, 2023

Uh oh!

Watcher infrastructure for remote clusters and arbitrary target kinds #201

Watcher infrastructure for remote clusters and arbitrary target kinds #201

Conversation

squaremo commented Oct 11, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Verifying it works

Effect of this PR

Checklist (from #196)

Uh oh!

Uh oh!

luizbafilho Oct 26, 2023

Choose a reason for hiding this comment

Uh oh!

squaremo Nov 1, 2023

Choose a reason for hiding this comment

Uh oh!

squaremo Nov 1, 2023

Choose a reason for hiding this comment

Uh oh!

luizbafilho Nov 1, 2023

Choose a reason for hiding this comment

Uh oh!

luizbafilho left a comment

Choose a reason for hiding this comment

Uh oh!

squaremo commented Nov 1, 2023

Uh oh!

squaremo commented Oct 11, 2023 •

edited

Loading