-
-
Notifications
You must be signed in to change notification settings - Fork 835
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve startup time #1408
Labels
Comments
tavianator
added a commit
to tavianator/fd
that referenced
this issue
Oct 30, 2023
We originally switched to bounded channels for backpressure to fix sharkdp#918. However, bounded channels have a significant initialization overhead as they pre-allocate a fixed-size buffer for the messages. This implementation uses a different backpressure strategy: each thread gets a limited-size pool of WorkerResults. When the size limit is hit, the sender thread has to wait for the receiver thread to handle a result from that pool and recycle it. Inspired by [snmalloc], results are recycled by sending the boxed result over a channel back to the thread that allocated it. By allocating and freeing each WorkerResult from the same thread, allocator contention is reduced dramatically. And since we now pass results by pointer instead of by value, message passing overhead is reduced as well. Fixes sharkdp#1408. [snmalloc]: https://github.com/microsoft/snmalloc
tavianator
added a commit
to tavianator/fd
that referenced
this issue
Oct 30, 2023
We originally switched to bounded channels for backpressure to fix sharkdp#918. However, bounded channels have a significant initialization overhead as they pre-allocate a fixed-size buffer for the messages. This implementation uses a different backpressure strategy: each thread gets a limited-size pool of WorkerResults. When the size limit is hit, the sender thread has to wait for the receiver thread to handle a result from that pool and recycle it. Inspired by [snmalloc], results are recycled by sending the boxed result over a channel back to the thread that allocated it. By allocating and freeing each WorkerResult from the same thread, allocator contention is reduced dramatically. And since we now pass results by pointer instead of by value, message passing overhead is reduced as well. Fixes sharkdp#1408. [snmalloc]: https://github.com/microsoft/snmalloc
tavianator
added a commit
to tavianator/fd
that referenced
this issue
Nov 1, 2023
We originally switched to bounded channels for backpressure to fix sharkdp#918. However, bounded channels have a significant initialization overhead as they pre-allocate a fixed-size buffer for the messages. This implementation uses a different backpressure strategy: each thread gets a limited-size pool of WorkerResults. When the size limit is hit, the sender thread has to wait for the receiver thread to handle a result from that pool and recycle it. Inspired by [snmalloc], results are recycled by sending the boxed result over a channel back to the thread that allocated it. By allocating and freeing each WorkerResult from the same thread, allocator contention is reduced dramatically. And since we now pass results by pointer instead of by value, message passing overhead is reduced as well. Fixes sharkdp#1408. [snmalloc]: https://github.com/microsoft/snmalloc
tavianator
added a commit
to tavianator/fd
that referenced
this issue
Nov 2, 2023
We originally switched to bounded channels for backpressure to fix sharkdp#918. However, bounded channels have a significant initialization overhead as they pre-allocate a fixed-size buffer for the messages. This implementation uses a different backpressure strategy: each thread gets a limited-size pool of WorkerResults. When the size limit is hit, the sender thread has to wait for the receiver thread to handle a result from that pool and recycle it. Inspired by [snmalloc], results are recycled by sending the boxed result over a channel back to the thread that allocated it. By allocating and freeing each WorkerResult from the same thread, allocator contention is reduced dramatically. And since we now pass results by pointer instead of by value, message passing overhead is reduced as well. Fixes sharkdp#1408. [snmalloc]: https://github.com/microsoft/snmalloc
tavianator
added a commit
to tavianator/fd
that referenced
this issue
Nov 2, 2023
We originally switched to bounded channels for backpressure to fix sharkdp#918. However, bounded channels have a significant initialization overhead as they pre-allocate a fixed-size buffer for the messages. This implementation uses a different backpressure strategy: each thread gets a limited-size pool of WorkerResults. When the size limit is hit, the sender thread has to wait for the receiver thread to handle a result from that pool and recycle it. Inspired by [snmalloc], results are recycled by sending the boxed result over a channel back to the thread that allocated it. By allocating and freeing each WorkerResult from the same thread, allocator contention is reduced dramatically. And since we now pass results by pointer instead of by value, message passing overhead is reduced as well. Fixes sharkdp#1408. [snmalloc]: https://github.com/microsoft/snmalloc
tavianator
added a commit
to tavianator/fd
that referenced
this issue
Nov 2, 2023
We originally switched to bounded channels for backpressure to fix sharkdp#918. However, bounded channels have a significant initialization overhead as they pre-allocate a fixed-size buffer for the messages. This implementation uses a different backpressure strategy: each thread gets a limited-size pool of WorkerResults. When the size limit is hit, the sender thread has to wait for the receiver thread to handle a result from that pool and recycle it. Inspired by [snmalloc], results are recycled by sending the boxed result over a channel back to the thread that allocated it. By allocating and freeing each WorkerResult from the same thread, allocator contention is reduced dramatically. And since we now pass results by pointer instead of by value, message passing overhead is reduced as well. Fixes sharkdp#1408. [snmalloc]: https://github.com/microsoft/snmalloc
That might be where crossbeam_channel initializes the memory for the channel |
1 task
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
fd
s startup time is quite slow. On my 12 core system, it takes ~ 20 ms for "searching" an empty folder. This is fast enough not to be noticeable by humans, but it looks bad in benchmarks when comparingfd
with other tools on small folders 1. And it's also an actual problem for use cases wherefd
is called repeatedly from a script.Some of that overhead is caused by the spawning of threads, and that problem is already tracked in #1203. But I think there is more that can be done. Instead of using my usual go-to performance tool (
perf
), let's look at the magic-trace output of afd
call in an empty folder 2. If someone is interested, I've attached the trace to this post. Go to https://magic-trace.org/ to load it in their viewer.The full trace looks like this:
The first 2.2 ms are typical process startup things (before
main
). I don't think there is any room for optimization here (?)The next ~2 ms are more interesting:
some notable steps (even if insignificant in time)
isatty
check (5 ·s)LsColors::from_env
(579 µs)Some things were surprising to me. I didn't expect the
get_num_cpus
call to take this long. There might be some room for improvement here by doing things in parallel (e.g.LsColors::from_env
)? But only if the thread overhead is not too high.Then we start the actual scan, which takes the majority of the time:
Here, I'm not so sure how to interpret the trace, as things are actually happening on multiple threads. But we can (presumably) see some of the thread spawning/joining time here (~ 5 ms):
and some gitignore matcher logic going on here (370 µs total):
Most of the time is actually unaccounted for in the trace, because I can only see:
We can see a bit more when switching off LTO:
Apparently, 11 ms are spent in
crossbeam_channel::channel::bounded
'sfrom_iter
method? (probably the receive call?) — even though we don't have any work to do. On a-j1
run, this part only takes 1 ms.Footnotes
Those "small" folders can be pretty large, actually. It takes hundreds of thousands of files before we can make up for the startup "penalty". ↩
I recently discovered this and used it successfully to benchmark (and then optimize) the startup time of other programs. ↩
The text was updated successfully, but these errors were encountered: