Skip to content

Synchronization issue when context times out #119

@shreyb

Description

@shreyb

Last week, a few bad destination nodes made the token-push run time out. We saw the following (edited) log lines emitted:

2025-02-20 09:08:50	{"account":"sbndci","destPath":"/tmp/vt_****","level":"error","msg":"Could not copy source file to destination node","node":"NODE","rsyncOpts":"--perms --chmod=u=r,go=","sourcePath":"PATH","time":"2025-02-20T09:08:50-06:00"}
	
2025-02-20 09:08:50	
{"account":"sbndci","destPath":"/tmp/vt_***-sbnd_ci","level":"error","msg":"Could not copy source file to destination node","node":"NODE","rsyncOpts":"--perms --chmod=u=r,go=","sourcePath":"PATH","time":"2025-02-20T09:08:50-06:00"}
	
2025-02-20 09:08:01	
{"caller":"runAdminNotificationHandler","level":"error","msg":"Timeout exceeded in notification Manager","time":"2025-02-20T09:08:01-06:00"}
	
2025-02-20 09:08:01	
{"caller":"notifications.runServiceNotificationHandler","level":"error","msg":"Timeout exceeded in notification Manager","service":"sbnd-data-globus_production","time":"2025-02-20T09:08:01-06:00"}
	
2025-02-20 09:08:01	
{"caller":"notifications.runServiceNotificationHandler","level":"error","msg":"Timeout exceeded in notification Manager","service":"hypot-gwms-test_production","time":"2025-02-20T09:08:01-06:00"}
	
2025-02-20 09:08:01	
{"caller":"notifications.runServiceNotificationHandler","level":"error","msg":"Timeout exceeded in notification Manager","service":"dune-ci_ci","time":"2025-02-20T09:08:01-06:00"}
	
2025-02-20 09:08:01	
{"caller":"notifications.runServiceNotificationHandler","level":"error","msg":"Timeout exceeded in notification Manager","service":"annie_production","time":"2025-02-20T09:08:01-06:00"}
	
2025-02-20 09:07:52	
{"experiment":"sbnd","level":"error","msg":"Error pushing vault tokens to destination node","node":"NODE","role":"production","time":"2025-02-20T09:07:52-06:00"}

So there were a bunch of notification handlers that seemed to be waiting on a single bad service. We need to refactor the notification handler so that if there's a timeout with one service, we just go ahead and send the rest of the messages (I suspect the service-level messages are fine, but it's just admin that's not), and then clean up the hanging goroutines properly.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions