-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Migrate alarm configuration to slo-alerts repo #1149
Conversation
templateFile: yamlTemplateFilePath, | ||
}); | ||
|
||
const registrationLoadBalancer = cfnTemplate.getResource( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This might look a little worrying, but the snapshots confirm that we're not actually deleting the load balancer 😅
This is just a reference to the load balancer which is defined via YAML.
@@ -17,43 +14,8 @@ export class Registration extends GuStack { | |||
// Until this project has been fully migrated to GuCDK you should update the 'old' infrastructure by modifying | |||
// the YAML file and then re-running the snapshot tests to confirm that the changes are being pulled through by | |||
// CDK | |||
const cfnTemplate = new CfnInclude(this, 'YamlTemplate', { | |||
new CfnInclude(this, 'YamlTemplate', { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems a bit pointless now that we're not actually provisioning any resources via GuCDK, but I don't think it's worth rolling back this wiring (introduced via #831) as it will slightly reduce the amount of work required when we decide to migrate this to a pattern.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree!
const budgetAlarm = new GuErrorBudgetAlarmExperimental(this, { | ||
sloName: `RegistrationAvailability${this.stage}Slo`, | ||
sloTarget: 0.995, | ||
badEvents: new MathExpression({ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can generalise the badEvents
and validEvents
needed for a 'typical' Availability SLI so that it works for every load balancer - so this now happens via some code that @guardian/devx-reliability will maintain: https://github.com/guardian/slo-alerts/blob/f715cb3c6cb48d5885d2ebe5d987f85acae4547b/cdk/lib/slo-alerts.ts#L28-L92
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Jacob - all looks very sensible!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you Jacob for the work!
What does this change?
These alarms (which were first introduced via #855) are going to be provisioned via a different repository.
There will be a few small changes to the alarms as part of this migration - most notably:
CODE
, since having alarms forCODE
generally creates noise that doesn't require an action (i.e. we expect to breakCODE
more frequently thanPROD
since 'real' users do not rely on it!). We could add aCODE
alarm viaslo-alerts
if the team think it's really useful, but I don't think it's necessary now that we're no longer testing this new approach.valid events
, rather than relying on theRequestCount
metric, as we found that the latter produced some unexpected/undesirable results (most notably, it wasn't incremented in certain 5XX error scenarios!)See https://github.com/guardian/slo-alerts/pull/19 for more details.
How to test
I've deployed this to
CODE
and confirmed that the only deletions are related to alarms (which is desirable in this case):N.B. launch config deletion at 10:51:49 relates to a previous deployment (AMI update)
How can we measure success?
Our (
PROD
) alarm coverage will be the same once this PR and https://github.com/guardian/slo-alerts/pull/19 have been merged, but future updates/improvements (implemented by @guardian/devx-reliability) will be picked up automatically rather than requiring a GuCDK update.Have we considered potential risks?
There is a small risk that we've made a mistake during the migration, but I have double checked that the config provisioned via https://github.com/guardian/slo-alerts/pull/19 is correct now that it has been deployed.