-
Notifications
You must be signed in to change notification settings - Fork 9
VYGR-391: Add opsgenie integrations to notifications #120
base: master
Are you sure you want to change the base?
Conversation
f6f5e25
to
04e1d7d
Compare
0b5b2f0
to
005de8c
Compare
Remove opsgenie secret work-in-progress
@@ -459,41 +465,65 @@ func (c *Controller) getServiceData(user auth.OptionalUser, name voyager.Service | |||
return c.ServiceCentral.GetService(context.Background(), user, servicecentral.ServiceName(name)) | |||
} | |||
|
|||
func (c *Controller) buildNotifications(spec creator_v1.ServiceSpec) (*orch_meta.Notifications, error) { | |||
func (c *Controller) buildNotifications(spec creator_v1.ServiceSpec) (*orch_meta.Notifications, bool /* retriable */, error) { | |||
// Default pagerduty values re-used from Micros config.js |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of return this at the end we now instantiate the notifications with defaults and override it if the pagerduty or opsgenie notifications creation is successful.
pkg/synchronization/controller.go
Outdated
}, nil | ||
} | ||
} | ||
integrations, retriable, err := c.getOpsgenieIntegrations(spec.Metadata.Opsgenie) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what happens if team is not specified in Service Central? Will we get an empty list of integrations?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do a check at the start of the called function to see if spec.Metadata.Opsgenie
is nil
func (c *Controller) getOpsgenieIntegrations(metadata *creator_v1.OpsgenieMetadata) ([]opsgenie.Integration, bool /* retriable */, error) {
// Opsgenie is optional
if metadata == nil {
return nil, true, nil
}
pkg/synchronization/controller.go
Outdated
} | ||
|
||
return &orch_meta.Notifications{ | ||
notifications = orch_meta.Notifications{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Email remains the same, we could set notifications.PagerdutyEndpoint
and notifications.LowPriorityPagerdutyEndpoint
to make it more explicit which parts do we want to modify.
return errors.Wrap(err, "unable to parse Deployinator URL") | ||
if err != nil { | ||
return errors.Wrap(err, "unable to parse Deployinator URL") | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure why this always returned an error and I cannot identify where this function is called.
I think its called in readAndValidateOptions
when yaml.UnmarshalStrict
is called, but if thats the case then I'm not sure why we always fail.
https://github.com/atlassian/voyager/blob/master/cmd/synchronization/app/options.go#L67
@ash2k Can you help me understand why we returned an error before?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
seems like a bug, not sure how did it work
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it should be implicitly called in https://github.com/atlassian/voyager/blob/master/cmd/synchronization/app/options.go#L67
EnvTypeDev EnvType = "dev" | ||
EnvTypeStaging EnvType = "staging" | ||
EnvTypeProd EnvType = "prod" | ||
EnvTypeGlobal EnvType = "null" // Intentionally a string called "null" as this is the expected result from opsgenie int manager |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc @jokeyrhyme
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not global
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure why its not called "global". @jokeyrhyme Do you know why?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@fcobb Well, it's not that things are global, it's just that they aren't explicitly environment-specific
Maybe this should actually mean Global, and maybe we should actually emit "global"
instead of "null"
When we put this together, the most important thing was making sure our CloudWatch alarms (which are environment-specific) were covered, everything else was secondary
I'm wondering whether we should call this Null or something so that we can meaningfully transition to Global in future?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jokeyrhyme I went for Global as if its not environment specific then its essentially global.
As you own this API I will defer these decisions to you and your team. My preference is to return a specific value for these non-environment specific values rather than null.
|
||
if found { | ||
service.Attributes = append(service.Attributes, ogTeamAttr) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Previously, we filtered the service attributes before adding them to the service object. We now do that filtering later on.
@@ -393,34 +394,6 @@ func TestGetServiceWithFailedAttributesCall(t *testing.T) { | |||
require.Equal(t, 2, handler.RequestSnapshots.Calls()) | |||
} | |||
|
|||
func TestGetServiceWithMultipleOpsGenieAttribute(t *testing.T) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test does not make sense anymore as it was making sure we errored on multiple OG attributes, this is no longer an error case as we handle multiple attributes later on.
maxSyncWorkers = 10 | ||
maxSyncWorkers = 10 | ||
defaultPagerdutyGeneric = "5d11612f25b840faaf77422edeff9c76" | ||
defaultPagerdutyCloudwatch = "https://events.pagerduty.com/adapter/cloudwatch_sns/v1/124e0f010f214a9b9f30b768e7b18e69" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These defaults were in the buildNotifcations
function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These values should come from configuration, not be constants. Please move them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll do this in a follow up PR as this PR is big enough.
@@ -459,41 +467,73 @@ func (c *Controller) getServiceData(user auth.OptionalUser, name voyager.Service | |||
return c.ServiceCentral.GetService(context.Background(), user, servicecentral.ServiceName(name)) | |||
} | |||
|
|||
func (c *Controller) buildNotifications(spec creator_v1.ServiceSpec) (*orch_meta.Notifications, error) { | |||
func (c *Controller) buildNotifications(spec creator_v1.ServiceSpec) (*orch_meta.Notifications, bool /* retriable */, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is easier to understand the changes to this function if you do a side-by-side diff.
pkg/servicecentral/store.go
Outdated
@@ -306,6 +306,11 @@ func serviceDataToService(data *ServiceDataRead) (*creator_v1.Service, error) { | |||
service.Spec.Metadata.PagerDuty = pagerDutyMetadata | |||
} | |||
|
|||
ogMetadata, _ := GetOpsGenieAttribute(data) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shall we log an error?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shall we log an error?
Considering this is optional, Should this be an error or info message?
ogMetadata, err := GetOpsGenieAttribute(data) // Error ignored as Opsgenie team is optional
if err != nil {
c.logger.Info("unable to get Opsgenie attribute - ignoring", zap.Error(err))
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would say that failure to make a call to Opsgenie integration manager is still an error, so I would log it at the error level. We should react to being unable to fetch Opsgenie integrations rather than completely ignoring them.
@ash2k WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not the path of calling OG int manager, this call will attempt to find any opsgenie attributes within the service object.
We call opsgenie int manager during the createOrUpdateServiceMetadata
function when building the notifications. I agree that we should know when calls to OG int manager fail, and we return an error in that case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
attempt to find any opsgenie attributes within the service object
If attribute not found, there won't be an error, but just a nil
result AFAICT?
So then the error means that either the attribute format is invalid, or there is some bug in Voyager code, and we should also log with error, I think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep you are correct, an error here is when:
- we found an issue with an Opsgenie attribute
- we found more than one Opsgenie attribute.
EDIT: As we have found a configured Opsgenie team but failed to process it, I am currently returning an error rather than logging it.
EnvTypeDev EnvType = "dev" | ||
EnvTypeStaging EnvType = "staging" | ||
EnvTypeProd EnvType = "prod" | ||
EnvTypeGlobal EnvType = "null" // Intentionally a string called "null" as this is the expected result from opsgenie int manager |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
?
return nil, nil // Not an error as Opsgenie is optional | ||
} | ||
|
||
filtered := make([]opsgenie.Integration, 0, 4) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We usually do not pre-allocate if we don't exactly know the number of items. Just a style thing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Happy to change this to make our style if you would like. We expect there to be exactly 4 but it is possible that they add more integrations in the future. What would you like to see in this scenario?
maxSyncWorkers = 10 | ||
maxSyncWorkers = 10 | ||
defaultPagerdutyGeneric = "5d11612f25b840faaf77422edeff9c76" | ||
defaultPagerdutyCloudwatch = "https://events.pagerduty.com/adapter/cloudwatch_sns/v1/124e0f010f214a9b9f30b768e7b18e69" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These values should come from configuration, not be constants. Please move them.
EnvTypeDev EnvType = "dev" | ||
EnvTypeStaging EnvType = "staging" | ||
EnvTypeProd EnvType = "prod" | ||
EnvTypeGlobal EnvType = "null" // Intentionally a string called "null" as this is the expected result from opsgenie int manager |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not global
?
@@ -17,7 +17,7 @@ import ( | |||
) | |||
|
|||
const ( | |||
opsGenieIntManURL = "https://micros.prod.atl-paas.net" | |||
opsgenieIntManURL = "https://micros.prod.atl-paas.net" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't use real data in tests please. Use fake data =)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a manual integration test, I don't think fake data is applicable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is fine to be merged. Please address the issues/nits in a follow up PR. I think the manual opsgenie integration test should be moved to the internal repo.
fcf889a
to
434198e
Compare
As discussed, please add a test for Opsgenie integrations present in |
Done |
Make sure you block the PR in stash updating dependencies after you merge this change, and fix config. |
What: This PR uses the Opsgenie integration manager client to add Opsgenie integrations into the service-metadata configmap, allowing autowiring functions to make use of the integrations.
Opsgenie integration manager client PR: #119
Manually tested the following scenarios in the playground: