-
Notifications
You must be signed in to change notification settings - Fork 2k
Description
A note for the community
- Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
- If you are interested in working on this issue or have submitted a pull request, please leave a comment
Use Cases
Vector exposes per-component utilization% and memory allocation metrics labeled by component ID. However, CPU profiling tools like perf and bpftrace can only distinguish code paths by component type (e.g. remap, log_to_metric), and therefore cannot determine which specific component is a bottleneck.
Attempted Solutions
1. Uprobe on on_enter() + on_exit(): A tool like bpftrace is able to record uprobe events annotated with runtime values (in this case, component ID's) and use this to attribute each moment in execution to whichever component is active. Each uprobe fires a kernel trap per Tokio poll, adding to ~4-6% total overhead.
2. Uprobe on on_enter() + on_close(): Unlike on_enter() and on_exit() which fire once per Tokio poll, on_close() only fires once at the end of an object's lifetime. This reduces overhead to ~2-3% overhead but attributes Tokio scheduling time between tasks to the preceding component - although this overhead seems to be negligible.
3. Shared BPF map: Maintain a map in bpftrace and then update from vector using bpf_map_update_elem() with each event. While this avoids the overhead of a kernel trap, it still involves a relatively slow syscall.
4. Fixed-length in-memory storage: Stores current active component_id per thread in an in-memory fixed-length array, which bpftrace can frequently poll. Achieves far fewer probes and negligible total overhead by requiring only one trap on component registration. We use a fixed-length array since bpftrace’s bpf_probe_read_user can only read at a specific location. Since we don't know the exact number of Tokio threads (i.e. the size of the array), we can only either allocate a large enough array (e.g. 2^22), or mod the thread_id since collisions are unlikely. The consequences of a collision are not terrible, but this is not ideal.
5. Thread-local component_id storage + bpftrace map: Rather than using a fixed-length array or a map in vector, we can have each thread store a single atomic value, register this address to bpftrace during initialization, and store a map in bpftrace.
Reusing an existing tracing layer can minimize the total amount of code changes. For example, AllocationLayer already exposes the on_enter() and on_exit() events. The main downside is that these features are tightly coupled, requiring overhead from allocation tracking.
Proposal
Option 5: Thread-local componentID storage + bpftrace map seems to be the best due to low overhead and no likelihood of collisions.
A separate tracing layer is preferred since the overhead from enabling allocation tracing is high.
PR: #24860
References
No response
Version
No response