Concurrent request handling #1062

khanjandharaiya · 2024-01-04T07:32:47Z

Hey there!! 🙏

I am currently working on a project that involves the sending request to the model using flask api and when user sends the request concurrently the model is not able to handle it. Is there any way i can handle multiple concurrent request to the model and serve multiple users at the same time?

Please help! @abetlen

AayushSameerShah · 2024-01-04T07:37:07Z

I am not sure if your case is similar, but I am facing the same issue:

I have created a flask API endpoint /request in which the user makes the POST request
Based on the information received by the user, I put that in the prompt and it gives result.
I have made the "threaded=True" in flask app.run(threaded=True) which means each new request can be processed separately.
But doing so, with 2 users, the server crashes because it goes for loading 2 models and doesn't work.

I am possibly looking for the same solution, I hope we find some solution to this.
Thanks for opening the issue 🙏🏻

PS: I was also looking for #771, #897 👀

This library also provides a server:

llama-cpp-python/llama_cpp/server/app.py

Lines 165 to 168 in 8207280

    
           interrupt_requests: bool = Field( 
        
               default=True, 
        
               description="Whether to interrupt requests when a new request is received.", 
        
           )

Probably that may help.

littlebai3618 · 2024-01-10T09:57:20Z

If the hardware computing power is insufficient, the benefits of parallel inference are low. I implemented a simple parallel inference using this project and tested it on V100S. Under the condition of 2 concurrency, the efficiency is not as high as that of a single request.

Supporting parallel inference (batch processing) is a very complex task, involving issues such as kv-cache logit. Instead, you can use the api_like_OAI.py provided by llama.cpp as an alternative. This service supports parallel inference, although the performance is slightly lower during parallel execution.

sergey-zinchenko · 2024-06-23T04:01:17Z

Hi! I just made such a solution for myself. Here is the code: https://github.com/sergey-zinchenko/llama-cpp-python/tree/model_lock_per_request

I did introduce async locking of all the model stuff for all kinds of requests—stream and not. All the requests will be handled one by one, so it's not kind of concurrent, but at least the server will not crash or interrupt the request it handles at the moment.

malik-787 · 2024-06-27T16:43:40Z

@sergey-zinchenko can you provide more details like what changes did you made and how to adopt them in our llama-cpp-python?

sergey-zinchenko · 2024-06-28T01:11:53Z

@sergey-zinchenko can you provide more details like what changes did you made and how to adopt them in our llama-cpp-python?

#1550

sergey-zinchenko · 2024-06-28T08:04:50Z

@malik-787 Shortly I added global async lock mechanism to handle request one by one limiting the number of maximum awaiting requests on the uvicorn level. The server stops crashing and stop closing ongoing inferencing and in my PR incoming requests just awaiting finishing of ongoing one. IMHO that way is more or less better for multiuser scenarios and for k8s deployment.

abetlen added the question Further information is requested label Jan 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Concurrent request handling #1062

Concurrent request handling #1062

khanjandharaiya commented Jan 4, 2024

AayushSameerShah commented Jan 4, 2024 •

edited

Loading

littlebai3618 commented Jan 10, 2024

sergey-zinchenko commented Jun 23, 2024

malik-787 commented Jun 27, 2024

sergey-zinchenko commented Jun 28, 2024

sergey-zinchenko commented Jun 28, 2024

Concurrent request handling #1062

Concurrent request handling #1062

Comments

khanjandharaiya commented Jan 4, 2024

AayushSameerShah commented Jan 4, 2024 • edited Loading

littlebai3618 commented Jan 10, 2024

sergey-zinchenko commented Jun 23, 2024

malik-787 commented Jun 27, 2024

sergey-zinchenko commented Jun 28, 2024

sergey-zinchenko commented Jun 28, 2024

AayushSameerShah commented Jan 4, 2024 •

edited

Loading