Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dogswatch: fix Agent handler and policy check #573

Merged
merged 33 commits into from
Dec 9, 2019
Merged

Conversation

jahkeup
Copy link
Member

@jahkeup jahkeup commented Dec 2, 2019

Issue #, if available:

indirectly #505, continues #239

Description of changes:

This fixes some misbehaving posting of intents that cause the Agent to otherwise loop over its own emitted events. This change splits the checks out and executes them on sitrep requests (during stabilization) as well as stabilize the handling of duplicate messages to the same end.

README is updated to include the steps needed to run dogswatch in a cluster to test and run - to test the container image will need to be updated to an ECR repository that's had the development image pushed to it.

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

@jahkeup
Copy link
Member Author

jahkeup commented Dec 2, 2019

This bug highlights the need to atomically emit, filter, and coordinate the events even from a single source, I don't yet have a greater solution in this way and I'd like to explore options that replace the strategy here in favor of async jobs scheduled with Kubernetes' native resources.

@jahkeup jahkeup changed the title dogswatch: Split update and resource annotation check in Agent dogswatch: fix Agent handler and policy check Dec 5, 2019
@jahkeup
Copy link
Member Author

jahkeup commented Dec 6, 2019

There's a fair amount of change here - mostly revolving around the tests and the supporting bug fixes protecting the codepaths against duplicate messaging which in turn causes greater issues with race conditions due to externalized data (annotations).

I'm just about ready to call this good - there's one case where the controller allows simultaneous updates incorrectly and I'm trying to chase down why. This issue should be resolved prior to pushing an image as this could/would cause clusters to down themselves in short order.

@jahkeup jahkeup force-pushed the dogswatch-sitrep branch 2 times, most recently from 1d9d615 to fd107d0 Compare December 7, 2019 00:27
This also adds additional logging to diagnose issues during run.

Signed-off-by: Jacob Vallejo <jakeev@amazon.com>
Signed-off-by: Jacob Vallejo <jakeev@amazon.com>
Signed-off-by: Jacob Vallejo <jakeev@amazon.com>
The preflight should startup before the workers' event loops begin to
catch wind of node action.

Signed-off-by: Jacob Vallejo <jakeev@amazon.com>
The early check could conflict with the first set of events sent to the
node and cause it to re-emit events that it shouldn't.

Signed-off-by: Jacob Vallejo <jakeev@amazon.com>
The logic was written against an "is" state, not the "is next" state.
Check is retuned to consider this.
The number of active nodes is dependent on their perceived and posted
Intent; the policy check and its context needs to consider the activity
based on these perceived Intents. Tests were also added for this
extracted predicate.
Queue protections against growing event submission will cause some
messages to be dropped or contextually denied due to policy at time of
evaluation. To allow for this, the policy checks must be at the time
that its beginning to take flight in a "single threaded" manner.
@jahkeup jahkeup force-pushed the dogswatch-sitrep branch 3 times, most recently from d38884e to 8b633c9 Compare December 7, 2019 02:43
@jahkeup jahkeup marked this pull request as ready for review December 7, 2019 02:44
Copy link
Contributor

@zmrow zmrow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Initial few comments

extras/dogswatch/README.md Outdated Show resolved Hide resolved
extras/dogswatch/README.md Outdated Show resolved Hide resolved
extras/dogswatch/README.md Show resolved Hide resolved
extras/dogswatch/README.md Outdated Show resolved Hide resolved
extras/dogswatch/README.md Outdated Show resolved Hide resolved
extras/dogswatch/pkg/controller/manager.go Outdated Show resolved Hide resolved
Copy link
Contributor

@etungsten etungsten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🐶 ⌚

@bottlerocket-os bottlerocket-os deleted a comment from jahkeup Dec 9, 2019
Copy link
Contributor

@zmrow zmrow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

📦

Copy link
Contributor

@patraw patraw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tested on my cluster and LGTM!

@jahkeup
Copy link
Member Author

jahkeup commented Dec 9, 2019

Force-pushed to fixup + squash commits - no delta here!

@jahkeup jahkeup merged commit 391dfa0 into develop Dec 9, 2019
@jahkeup jahkeup deleted the dogswatch-sitrep branch December 9, 2019 22:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants