diff --git a/src/SUMMARY.md b/src/SUMMARY.md index 2e71aed..211d789 100644 --- a/src/SUMMARY.md +++ b/src/SUMMARY.md @@ -20,7 +20,7 @@ - [llama.cpp](./deployments/llama.cpp/README.md) - [Installing on AWS EC2 with CUDA](./deployments/llama.cpp/aws-ec2-cuda/README.md) - [Installing with AWS Image Builder](./deployments/llama.cpp/aws-image-builder/README.md) - - [Kubernetes]() + - [Kubernetes]() - [Ollama](./deployments/ollama/README.md) - [Paddler](./deployments/paddler/README.md) - [VLLM]() @@ -40,7 +40,7 @@ - [Long-Running](./application-layer/architecture/long-running/README.md) - [Serverless]() - [Optimization]() - - [Asynchronous Programming]() + - [Asynchronous Programming](./application-layer/optimization/asynchronous-programming/README.md) - [Input/Ouput Bottlenecks]() - [Tutorials]() - [LLM WebSocket chat with llama.cpp]() diff --git a/src/application-layer/README.md b/src/application-layer/README.md index 349c75f..b3d809a 100644 --- a/src/application-layer/README.md +++ b/src/application-layer/README.md @@ -6,7 +6,7 @@ Those applications have to deal with some issues that are not typically met in t Up until [Large Language Models](/general-concepts/large-language-model) became mainstream and in demand by a variety of applications, the issue of dealing with long-running requests was much less prevalent. Typically, due to functional requirements, all the microservice requests normally would take 10ms or less, while waiting for a [Large Language Models](/general-concepts/large-language-model) to complete the inference can take multiple seconds. -That calls for some adjustments in the application architecture, non-blocking [Input/Output](/general-concepts/input-output) and asynchronous programming. +That calls for some adjustments in the application architecture, non-blocking [Input/Output](/general-concepts/input-output) and [asynchronous programming](/application-layer/optimization/asynchronous-programming). This is where asynchronous programming languages shine, like Python with its `asyncio` library or Rust with its `tokio` library, Go with its goroutines, etc. diff --git a/src/application-layer/optimization/asynchronous-programming/README.md b/src/application-layer/optimization/asynchronous-programming/README.md new file mode 100644 index 0000000..80217e3 --- /dev/null +++ b/src/application-layer/optimization/asynchronous-programming/README.md @@ -0,0 +1,31 @@ +# Asynchronous Programming + +By asynchronous programming, we mean the ability to execute multiple tasks concurrently without blocking the main thread; that does not necessarily involve using threads and processes. A good example is the JavaScript execution model, which is, by default, single-threaded but asynchronous. It does not offer parallelism (without worker threads), but it can still issue concurrent network requests, database queries, etc. + +Considering that most of the bottlenecks related to working with Large Language Models stem from [Input/Output](/general-concepts/input-output) issues (primarily the LLM APIs response times and the time it takes to generate all the completion tokens) and not the CPU itself, asynchronous programming techniques are often the necessity when architecting the applications. + +When it comes to network requests, large language models pose a different challenge than most web applications. While most of the [REST](https://en.wikipedia.org/wiki/REST) APIs tend to have consistent response times below 100ms, when working with large language model web APIs, the response times might easily reach 20-30 seconds until all the requested tokens are generated and streamed. + +## Affected Runtimes + +Scripting languages like PHP and Ruby are primarily affected because they are synchronous by default. That is especially cumbersome with PHP, which uses [FPM](https://www.php.net/manual/en/install.fpm.php) pool of workers as a common hosting method. For example, Debian's worker pool amounts to five workers by default. That means if each of them would be busy handling 30-second requests, the sixth request would have to wait for the first one to finish. That also means that you can easily run into a situation where your server's CPU is idling, but it can't accept more requests simultaneously. + +## Coroutines, Promises to the Rescue + +To mitigate the issue, you can use any programming language supporting async, which primarily manifests in supporting Promises (Futures) or Coroutines. That includes JavaScript, Golang, Python (with `asyncio`), and PHP (with `Swoole`). + +## Preemptive vs Cooperative Scheduling + +It is also really important to understand the preemptive aspect of async languages. Although preemptiveness is an aspect primarily of threading, it plays a role when scheduling promises and coroutines. For example, PHP natively implements [Fibers](https://www.php.net/manual/en/language.fibers.php), which grants it some degree of asynchronicity, although they are not preemptive. This means if you try something silly in your code, like, for example: + +```php +