Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve handling of iRODS/taskflow failures with connection timeouts #2028

Open
mikkonie opened this issue Oct 14, 2024 · 0 comments
Open

Improve handling of iRODS/taskflow failures with connection timeouts #2028

mikkonie opened this issue Oct 14, 2024 · 0 comments
Labels
app: irodsbackend Issue in the irodsbackend app app: taskflowbackend Issue in the taskflowbackend app environment Issues of dependencies, CI, deployment etc. feature Requested feature or enhancement tbd Comments wanted, spec/schedule/prioritization to be decided, etc.

Comments

@mikkonie
Copy link
Contributor

mikkonie commented Oct 14, 2024

Something I noticed in one of the projects on our production server. In one project, multiple landing zones have failed to delete with the error "failed to remove collection". iRODS log displays no errors during this time period.

Looking into Docker Compose logs, it seems connections from celeryd to iRODS have been timing out during this period:

sodar-celeryd-default_1  | [2024-10-13 16:11:55,944: WARNING/ForkPoolWorker-1] Exception ignored in: 
sodar-celeryd-default_1  | [2024-10-13 16:11:55,945: WARNING/ForkPoolWorker-1] <function Connection.__del__ at 0x7fa18c02e8b0>
sodar-celeryd-default_1  | [2024-10-13 16:11:55,945: WARNING/ForkPoolWorker-1] Traceback (most recent call last):
sodar-celeryd-default_1  | [2024-10-13 16:11:55,945: WARNING/ForkPoolWorker-1]   File "/usr/local/lib/python3.8/site-packages/irods/connection.py", line 90, in __del__
sodar-celeryd-default_1  | [2024-10-13 16:11:55,945: WARNING/ForkPoolWorker-1]     
sodar-celeryd-default_1  | [2024-10-13 16:11:55,946: WARNING/ForkPoolWorker-1] self.disconnect()
sodar-celeryd-default_1  | [2024-10-13 16:11:55,946: WARNING/ForkPoolWorker-1]   File "/usr/local/lib/python3.8/site-packages/irods/connection.py", line 306, in disconnect
sodar-celeryd-default_1  | [2024-10-13 16:11:55,946: WARNING/ForkPoolWorker-1]     
sodar-celeryd-default_1  | [2024-10-13 16:11:55,946: WARNING/ForkPoolWorker-1] self.socket = self.socket.unwrap()
sodar-celeryd-default_1  | [2024-10-13 16:11:55,946: WARNING/ForkPoolWorker-1]   File "/usr/local/lib/python3.8/ssl.py", line 1285, in unwrap
sodar-celeryd-default_1  | [2024-10-13 16:11:55,947: WARNING/ForkPoolWorker-1]     
sodar-celeryd-default_1  | [2024-10-13 16:11:55,947: WARNING/ForkPoolWorker-1] s = self._sslobj.shutdown()
sodar-celeryd-default_1  | [2024-10-13 16:11:55,948: WARNING/ForkPoolWorker-1] socket
sodar-celeryd-default_1  | [2024-10-13 16:11:55,948: WARNING/ForkPoolWorker-1] .
sodar-celeryd-default_1  | [2024-10-13 16:11:55,948: WARNING/ForkPoolWorker-1] timeout
sodar-celeryd-default_1  | [2024-10-13 16:11:55,948: WARNING/ForkPoolWorker-1] : 
sodar-celeryd-default_1  | [2024-10-13 16:11:55,948: WARNING/ForkPoolWorker-1] The read operation timed out
sodar-celeryd-default_1  | 2024-10-13 16:11:55,990 [ERROR] taskflowbackend.flows: Exception in run_flow(): Failed to remove collection
sodar-celeryd-default_1  | 2024-10-13 16:11:55,995 [ERROR] taskflowbackend.api: Error running flow: Failed to remove collection

It's not the first time I've seen something like this, but I'd like us to try somehow handle these better. At least we should report the timeout instead of a generic "unable to remove collection", if at all possible. Catching the timeout exception and reporting back in timeline/zone status would be a start.

As for the failure itself, it would seem this is some kind of temporary network error. The iRODS server itself appears to be up and running just fine at this point and afterwards everything appears to have recovered without changes. The servers are running as docker containers in the same Docker Compose network, but each server is accessed by its FQDM. Could this just be a temporary DNS glitch?

Ideas are welcome.

@mikkonie mikkonie added bug Something isn't working environment Issues of dependencies, CI, deployment etc. tbd Comments wanted, spec/schedule/prioritization to be decided, etc. app: irodsbackend Issue in the irodsbackend app app: taskflowbackend Issue in the taskflowbackend app feature Requested feature or enhancement and removed bug Something isn't working labels Oct 14, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
app: irodsbackend Issue in the irodsbackend app app: taskflowbackend Issue in the taskflowbackend app environment Issues of dependencies, CI, deployment etc. feature Requested feature or enhancement tbd Comments wanted, spec/schedule/prioritization to be decided, etc.
Projects
None yet
Development

No branches or pull requests

1 participant