-
Notifications
You must be signed in to change notification settings - Fork 867
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TorchServe crashes in production with `WorkerThread - IllegalStateException error' #3087
Comments
@MaelitoP The following log shows that gRPC client closed the connection which caused the response was not able to send response and threw error. Could you please check your gRPC client side? |
@lxning I'm checking, but I see nothing at the moment. When I look to the client logs, we can see that the server respond with an error:
As far as I know, if the connection is lost, we get an error like But why does the server crash when a client closes the connection? This behaviour looks weird to me, but that's probably common, I don't know. |
Hello,
When I did this timeout test, the
What can I do on the |
Are you sure that it's not a bug on We try to do You do a PR some time ago @lxning to fix this problem (#2420). This regression was introduce in #2513. I'm right or I just don't understand the logic ? |
I'm having the same problem. Although the client closes the grpc channel if it times out, Torchserve shouldn't crash, or crash but boot a worker again, not leave it unusable.... |
@lxning |
We have the same issue on our side. I was wondering why our CI process was unstable and unable to handle our tests with only one worker and one instance. As soon as we have a timeout on a request, our worker is no longer available. In prod we saw only the problem on unexpected spikes of requests as we have multiple instances with multiple workers and thanks to daily deployment, we got our stale workers back. |
I made an attempt to fix the issue here: #3267 Being new to this repository, it's a draft for now. |
Downgrading torchserve=0.9.0 helped solve this issue. |
Indeed, this version is before the refactoring that reintroduced the bug. Unfortunately, at least in my case, we also need the gRPC configurations for client-side load-balancing (connection max age and grace time) that were introduced in version |
Hello,
I have a problem with
TorchServe
in production. When I start the server there is no problem and everything works perfectly, I can use the model and make requests on it. But after around 3 days the server crashes and I don't understand why.If you look at the monitoring metrics, you can see that
ts_inference_requests_total
is ~1 req/s. And thePredictionTime
is between 40 and 600ms.My attempt to trigger the error in local was unsuccessful 😞
Server error logs (
TorchServe
)Client error logs (Go client)
Dockerfile
Configuration (
config.properties
)Thank you so much for you time
The text was updated successfully, but these errors were encountered: