Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Concurrent request handling #1062

Open
khanjandharaiya opened this issue Jan 4, 2024 · 6 comments
Open

Concurrent request handling #1062

khanjandharaiya opened this issue Jan 4, 2024 · 6 comments
Labels
question Further information is requested

Comments

@khanjandharaiya
Copy link

Hey there!! 🙏

I am currently working on a project that involves the sending request to the model using flask api and when user sends the request concurrently the model is not able to handle it. Is there any way i can handle multiple concurrent request to the model and serve multiple users at the same time?

Please help! @abetlen

@AayushSameerShah
Copy link

AayushSameerShah commented Jan 4, 2024

I am not sure if your case is similar, but I am facing the same issue:

  1. I have created a flask API endpoint /request in which the user makes the POST request
  2. Based on the information received by the user, I put that in the prompt and it gives result.
  3. I have made the "threaded=True" in flask app.run(threaded=True) which means each new request can be processed separately.
  4. But doing so, with 2 users, the server crashes because it goes for loading 2 models and doesn't work.

I am possibly looking for the same solution, I hope we find some solution to this.
Thanks for opening the issue 🙏🏻


PS: I was also looking for #771, #897 👀

This library also provides a server:

interrupt_requests: bool = Field(
default=True,
description="Whether to interrupt requests when a new request is received.",
)

Probably that may help.

@littlebai3618
Copy link

If the hardware computing power is insufficient, the benefits of parallel inference are low. I implemented a simple parallel inference using this project and tested it on V100S. Under the condition of 2 concurrency, the efficiency is not as high as that of a single request.

Supporting parallel inference (batch processing) is a very complex task, involving issues such as kv-cache logit. Instead, you can use the api_like_OAI.py provided by llama.cpp as an alternative. This service supports parallel inference, although the performance is slightly lower during parallel execution.

@abetlen abetlen added the question Further information is requested label Jan 12, 2024
@sergey-zinchenko
Copy link

Hi! I just made such a solution for myself. Here is the code: https://github.com/sergey-zinchenko/llama-cpp-python/tree/model_lock_per_request

I did introduce async locking of all the model stuff for all kinds of requests—stream and not. All the requests will be handled one by one, so it's not kind of concurrent, but at least the server will not crash or interrupt the request it handles at the moment.

@malik-787
Copy link

@sergey-zinchenko can you provide more details like what changes did you made and how to adopt them in our llama-cpp-python?

@sergey-zinchenko
Copy link

@sergey-zinchenko can you provide more details like what changes did you made and how to adopt them in our llama-cpp-python?

#1550

@sergey-zinchenko
Copy link

@malik-787 Shortly I added global async lock mechanism to handle request one by one limiting the number of maximum awaiting requests on the uvicorn level. The server stops crashing and stop closing ongoing inferencing and in my PR incoming requests just awaiting finishing of ongoing one. IMHO that way is more or less better for multiuser scenarios and for k8s deployment.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

6 participants