Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Migrate alarm configuration to slo-alerts repo #1149

Merged
merged 1 commit into from
Aug 15, 2023

Conversation

jacobwinch
Copy link
Contributor

@jacobwinch jacobwinch commented Aug 10, 2023

What does this change?

These alarms (which were first introduced via #855) are going to be provisioned via a different repository.

There will be a few small changes to the alarms as part of this migration - most notably:

  • We no longer provision these alarms for CODE, since having alarms for CODE generally creates noise that doesn't require an action (i.e. we expect to break CODE more frequently than PROD since 'real' users do not rely on it!). We could add a CODE alarm via slo-alerts if the team think it's really useful, but I don't think it's necessary now that we're no longer testing this new approach.
  • We now use a calculation for valid events, rather than relying on the RequestCount metric, as we found that the latter produced some unexpected/undesirable results (most notably, it wasn't incremented in certain 5XX error scenarios!)

See https://github.com/guardian/slo-alerts/pull/19 for more details.

How to test

I've deployed this to CODE and confirmed that the only deletions are related to alarms (which is desirable in this case):

image
N.B. launch config deletion at 10:51:49 relates to a previous deployment (AMI update)

How can we measure success?

Our (PROD) alarm coverage will be the same once this PR and https://github.com/guardian/slo-alerts/pull/19 have been merged, but future updates/improvements (implemented by @guardian/devx-reliability) will be picked up automatically rather than requiring a GuCDK update.

Have we considered potential risks?

There is a small risk that we've made a mistake during the migration, but I have double checked that the config provisioned via https://github.com/guardian/slo-alerts/pull/19 is correct now that it has been deployed.

templateFile: yamlTemplateFilePath,
});

const registrationLoadBalancer = cfnTemplate.getResource(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might look a little worrying, but the snapshots confirm that we're not actually deleting the load balancer 😅

This is just a reference to the load balancer which is defined via YAML.

@@ -17,43 +14,8 @@ export class Registration extends GuStack {
// Until this project has been fully migrated to GuCDK you should update the 'old' infrastructure by modifying
// the YAML file and then re-running the snapshot tests to confirm that the changes are being pulled through by
// CDK
const cfnTemplate = new CfnInclude(this, 'YamlTemplate', {
new CfnInclude(this, 'YamlTemplate', {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems a bit pointless now that we're not actually provisioning any resources via GuCDK, but I don't think it's worth rolling back this wiring (introduced via #831) as it will slightly reduce the amount of work required when we decide to migrate this to a pattern.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree!

const budgetAlarm = new GuErrorBudgetAlarmExperimental(this, {
sloName: `RegistrationAvailability${this.stage}Slo`,
sloTarget: 0.995,
badEvents: new MathExpression({
Copy link
Contributor Author

@jacobwinch jacobwinch Aug 10, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can generalise the badEvents and validEvents needed for a 'typical' Availability SLI so that it works for every load balancer - so this now happens via some code that @guardian/devx-reliability will maintain: https://github.com/guardian/slo-alerts/blob/f715cb3c6cb48d5885d2ebe5d987f85acae4547b/cdk/lib/slo-alerts.ts#L28-L92

@jacobwinch jacobwinch marked this pull request as ready for review August 14, 2023 12:40
Copy link
Contributor

@DavidLawes DavidLawes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Jacob - all looks very sensible!

Copy link
Contributor

@waisingyiu waisingyiu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you Jacob for the work!

@jacobwinch jacobwinch merged commit debb499 into main Aug 15, 2023
1 check passed
@jacobwinch jacobwinch deleted the jw-error-budget-alert-migration branch August 15, 2023 07:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants