Synchronization issue when context times out

Last week, a few bad destination nodes made the `token-push` run time out. We saw the following (edited) log lines emitted:

```
2025-02-20 09:08:50	{"account":"sbndci","destPath":"/tmp/vt_****","level":"error","msg":"Could not copy source file to destination node","node":"NODE","rsyncOpts":"--perms --chmod=u=r,go=","sourcePath":"PATH","time":"2025-02-20T09:08:50-06:00"}
	
2025-02-20 09:08:50	
{"account":"sbndci","destPath":"/tmp/vt_***-sbnd_ci","level":"error","msg":"Could not copy source file to destination node","node":"NODE","rsyncOpts":"--perms --chmod=u=r,go=","sourcePath":"PATH","time":"2025-02-20T09:08:50-06:00"}
	
2025-02-20 09:08:01	
{"caller":"runAdminNotificationHandler","level":"error","msg":"Timeout exceeded in notification Manager","time":"2025-02-20T09:08:01-06:00"}
	
2025-02-20 09:08:01	
{"caller":"notifications.runServiceNotificationHandler","level":"error","msg":"Timeout exceeded in notification Manager","service":"sbnd-data-globus_production","time":"2025-02-20T09:08:01-06:00"}
	
2025-02-20 09:08:01	
{"caller":"notifications.runServiceNotificationHandler","level":"error","msg":"Timeout exceeded in notification Manager","service":"hypot-gwms-test_production","time":"2025-02-20T09:08:01-06:00"}
	
2025-02-20 09:08:01	
{"caller":"notifications.runServiceNotificationHandler","level":"error","msg":"Timeout exceeded in notification Manager","service":"dune-ci_ci","time":"2025-02-20T09:08:01-06:00"}
	
2025-02-20 09:08:01	
{"caller":"notifications.runServiceNotificationHandler","level":"error","msg":"Timeout exceeded in notification Manager","service":"annie_production","time":"2025-02-20T09:08:01-06:00"}
	
2025-02-20 09:07:52	
{"experiment":"sbnd","level":"error","msg":"Error pushing vault tokens to destination node","node":"NODE","role":"production","time":"2025-02-20T09:07:52-06:00"}
```

So there were a bunch of notification handlers that seemed to be waiting on a single bad service.  We need to refactor the notification handler so that if there's a timeout with one service, we just go ahead and send the rest of the messages (I suspect the service-level messages are fine, but it's just admin that's not), and then clean up the hanging goroutines properly.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Synchronization issue when context times out #119

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Synchronization issue when context times out #119

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions