Change server approach to handle parallel requests #1550

sergey-zinchenko · 2024-06-24T08:00:50Z

I have made a change to the way the server handles concurrent requests. In this PR, arriving requests will wait for the model's global async lock. I.e., requests will be organized in something like a queue. On top of that, I added a configuration for the unicorn to have only ten concurrent requests. So finally, up to ten parallel requests will await like "in a queue" for the model lock, and the current request will not be interrupted. If 11's request arrives, the server will send out 503 response immediately. This approach suits the common scenarios with multiuser chatbot UI and API access.
I also changed some other stuff to fix PEP warnings by linter in IDE.

…uest

sergey-zinchenko · 2024-06-26T17:01:30Z

@abetlen What do you think about that changes?

gerdemann · 2024-08-19T12:41:19Z

Hey, thanks for this pr. Is it possible that we can get the pr merged? 😄

sergey-zinchenko · 2024-08-19T13:25:21Z

@gerdemann @Smartappli Hi! I authored this PR two month ago) Looks like it has some conflicts now. I can fix it today if there is somebody who can merge it right after that)

sergey-zinchenko · 2024-08-19T13:29:02Z

@gerdemann @Smartappli and I se some activity during that two month related to the way how server handles parallel requests in main branch. Is that still an issue?

gerdemann · 2024-08-19T13:44:25Z

I still get this error when two requests are made at the same time:

disconnected
Disconnected from client (via refresh/close) Address(host='10.32.20.82', port=58506)
ERROR:    ASGI callable returned without completing response.
Llama.generate: 64 prefix-match hit, remaining 45 prompt tokens to eval

I tried to install your branch directly and test it. But I get this error:

Exception: Task <Task pending name='Task-7' coro=<RequestResponseCycle.run_asgi() running at /llama/lib/python3.9/site-packages/uvicorn/protocols/http/h11_impl.py:406> cb=[set.discard()]> got Future <Future pending> attached to a different loop
Traceback (most recent call last):
  File "/llama/lib/python3.9/site-packages/llama_cpp/server/errors.py", line 170, in custom_route_handler
    response = await original_route_handler(request)
  File "/llama/lib/python3.9/site-packages/fastapi/routing.py", line 278, in app
    raw_response = await run_endpoint_function(
  File "/llama/lib/python3.9/site-packages/fastapi/routing.py", line 191, in run_endpoint_function
    return await dependant.call(**values)
  File "/llama/lib/python3.9/site-packages/llama_cpp/server/app.py", line 482, in create_chat_completion
    return await handle_completion_request(request, body,
  File "/llama/lib/python3.9/site-packages/llama_cpp/server/app.py", line 250, in handle_completion_request
    async for response in completion_iter:
  File "/llama/lib/python3.9/site-packages/llama_cpp/server/app.py", line 203, in completion_async_generator
    async with llama_proxy_context_manager as llama_proxy:
  File "/llama/lib/python3.9/site-packages/llama_cpp/server/app.py", line 79, in __aenter__
    await self._lock.acquire()
  File "/llama/lib/python3.9/asyncio/locks.py", line 120, in acquire
    await fut
RuntimeError: Task <Task pending name='Task-7' coro=<RequestResponseCycle.run_asgi() running at /llama/lib/python3.9/site-packages/uvicorn/protocols/http/h11_impl.py:406> cb=[set.discard()]> got Future <Future pending> attached to a different loop
INFO:     10.32.20.82:37322 - "POST /v1/chat/completions HTTP/1.1" 500 Internal Server Error

Do you have any idea what I am doing wrong?

Fu-Cheng · 2024-09-20T07:21:47Z

RuntimeError: Task <Task pending name='Task-7' coro=<RequestResponseCycle.run_asgi() running at /llama/lib/python3.9/site-packages/uvicorn/protocols/http/h11_impl.py:406> cb=[set.discard()]> got Future attached to a different loop
INFO: 10.32.20.82:37322 - "POST /v1/chat/completions HTTP/1.1" 500 Internal Server Error

@sergey-zinchenko,

Hi, I encountered the same issue. The service is still not handling concurrent requests properly. When I send a second request while the LLM is still generating a response for the first request, I receive this error.

Sergei Zinchenko added 2 commits June 23, 2024 05:41

[model_lock_per_request] model lock rewrote to be async lock over req…

10732af

…uest

[model_lock_per_request] added limit_concurrency for uvicorn

71e28b7

sergey-zinchenko changed the title ~~Change server approach to handle parallel request~~ Change server approach to handle parallel requests Jun 24, 2024

sergey-zinchenko mentioned this pull request Jun 28, 2024

Concurrent request handling #1062

Open

Smartappli approved these changes Aug 2, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Change server approach to handle parallel requests #1550

Change server approach to handle parallel requests #1550

sergey-zinchenko commented Jun 24, 2024

sergey-zinchenko commented Jun 26, 2024

gerdemann commented Aug 19, 2024

sergey-zinchenko commented Aug 19, 2024

sergey-zinchenko commented Aug 19, 2024

gerdemann commented Aug 19, 2024 •

edited

Loading

Fu-Cheng commented Sep 20, 2024

Change server approach to handle parallel requests #1550

Are you sure you want to change the base?

Change server approach to handle parallel requests #1550

Conversation

sergey-zinchenko commented Jun 24, 2024

sergey-zinchenko commented Jun 26, 2024

gerdemann commented Aug 19, 2024

sergey-zinchenko commented Aug 19, 2024

sergey-zinchenko commented Aug 19, 2024

gerdemann commented Aug 19, 2024 • edited Loading

Fu-Cheng commented Sep 20, 2024

gerdemann commented Aug 19, 2024 •

edited

Loading