Improve handling of iRODS/taskflow failures with connection timeouts #2028
Labels
app: irodsbackend
Issue in the irodsbackend app
app: taskflowbackend
Issue in the taskflowbackend app
environment
Issues of dependencies, CI, deployment etc.
feature
Requested feature or enhancement
tbd
Comments wanted, spec/schedule/prioritization to be decided, etc.
Something I noticed in one of the projects on our production server. In one project, multiple landing zones have failed to delete with the error "failed to remove collection". iRODS log displays no errors during this time period.
Looking into Docker Compose logs, it seems connections from celeryd to iRODS have been timing out during this period:
It's not the first time I've seen something like this, but I'd like us to try somehow handle these better. At least we should report the timeout instead of a generic "unable to remove collection", if at all possible. Catching the timeout exception and reporting back in timeline/zone status would be a start.
As for the failure itself, it would seem this is some kind of temporary network error. The iRODS server itself appears to be up and running just fine at this point and afterwards everything appears to have recovered without changes. The servers are running as docker containers in the same Docker Compose network, but each server is accessed by its FQDM. Could this just be a temporary DNS glitch?
Ideas are welcome.
The text was updated successfully, but these errors were encountered: