TooManyCooks is a runtime for C++20 coroutines. Its objectives are:
- seamless intermingling of cpu-bound, i/o bound, and heterogeneous (GPU/TPU/NPU/etc...) execution in the same code path
- maximum performance
- minimum boilerplate and clean interface
- simple upgrade path for existing libraries
It provides:
- a blazing fast lock-free work-stealing thread pool (
ex_cpu
) that supports both coroutines and regular functors - automatic, hardware-optimized thread configuration via hwloc
- a global executor instance so you can submit work from anywhere
- support for multiple priority levels
- building blocks:
task<Result>
is TMC's native lazy coroutine type- tasks can spawn child tasks, which may be:
- awaited immediately
- eagerly spawned, and lazily awaited
- eagerly spawned, and not awaited (detached)
- the prior submit/await operations can also all be done in bulk
ex_braid
is an async mutex / serializing executorpost_waitable()
for an external thread to wait (block) on thread pool work
- convenience functions:
yield()
/yield_if_requested()
to implement fiber-like cooperative multitasking based on priorityresume_on()
to move the coroutine to a different executor, as either a free function or an awaitable customizationasync_main()
quickstart function
Integrations with other libraries:
- Asio (via tmc-asio) - provides network I/O, file I/O, and timers
TooManyCooks is a header-only library. You can either include the specific headers that you need in each file, or #include "tmc/all_headers.hpp"
, which contains all of the other headers.
In order to reduce compile times, some files have separated declarations and definitions. The definitions will appear in whichever compilation unit defines TMC_IMPL
. Since each function must be defined exactly once, you must include this definition in exactly one compilation unit. The simplest way to accomplish this is to put it in your main.cpp
:
#define TMC_IMPL
#include "tmc/all_headers.hpp"
int main() {
return tmc::async_main(()[] -> tmc::task<int> {
// Hello, world!
co_return 0;
}());
}
https://github.com/tzcnt/tmc-examples
In order to keep this repository bloat-free, the examples are in a separate repository. The examples CMake config will automatically download this, and other TMC ecosystem projects, as a dependency.
TooManyCooks supports the following configuration parameters, supplied as preprocessor definitions:
TMC_USE_HWLOC
(defaultOFF
) enables hwloc integration, allowing TMC to automatically create optimized thread layouts and work-stealing groups. This requires that you add the directory containinghwloc.h
to your include path, and thehwloc
library path to you your linker path. It is highly recommended to use this.TMC_PRIORITY_COUNT=
(default unset) allows you to set the number of priority levels at compile-time, rather than at runtime. The main use case for this is to set the value to 1, which will remove all priority-specific code, making things slightly faster.TMC_WORK_ITEM=
(defaultCORO
) controls the type used to store work items in the work stealing queue. Any type can store both a coroutine or a functor, but the performance characteristics are different. There are 4 options:
Value | Type | sizeof(type) | Comments |
---|---|---|---|
CORO | std::coroutine_handle<> | 8 | Functors will be wrapped in a coroutine trampoline. |
FUNC | std::function<void()> | 32 | Coroutines will be stored inline using small buffer optimization. This has substantially worse performance than coro_functor when used for coroutines. |
FUNCORO | tmc::coro_functor | 16 | Stores either a coroutine or a functor using pointer tagging. Does not support small-object optimization. Supports move-only functors, or references to functors. Typed deleter is implemented with a shim. |
FUNCORO32 | tmc::coro_functor32 | 32 | Stores either a coroutine or a functor using pointer tagging. Does not support small-object optimization. Supports move-only functors, or references to functors. |
- documentation
- cancellation
- simultaneously await multiple awaitables with different types
- algorithms that depend on the prior 2 (select)
Planned integrations:
- CUDA (tmc-cuda) - a CUDA Graph can be made into an awaitable by adding a callback to the end of the graph with cudaGraphAddHostNode which will resume the awaiting coroutine
- gRPC (tmc-grpc) - via the callback interface if it is sufficiently stable / well documented. otherwise via the completion queue thread
- blosc2 (tmc-blosc2) - port to C++. use tmc-asio + io_uring for file I/O, and ex_cpu to replace the inbuilt pthreads. break down operations into smaller vertical slices to exploit dynamic parallelism.
Linux:
- Clang 17 or newer
- GCC 12.3 or newer
Windows:
- Clang 17 or newer (via clang-cl.exe)
Clang 16 will compile TMC, and things mostly work; however, there a number of subtle coroutine code generation issues, such as llvm/llvm-project#63022, which were only fixed in Clang 17.
MSVC on Windows currently compiles TMC, but crashes at runtime, likely due to this code generation bug.
- x86_64 with support for POPCNT / TZCNT
- AArch64
TooManyCooks has been tested on the following physical devices:
- Intel i7 4770k
- AMD Ryzen 5950X
- AMD EPYC 7742
- Rockchip RK3588S (in a Khadas Edge2)
TooManyCooks has not been tested on an M1+ Mac, or any Intel Hybrid (12th gen Core or newer) architecture. These platforms represent unique optimization challenges, and I am interested in purchasing one of these parts, for the right price. Contact me if you want to support the project :)