-
-
Notifications
You must be signed in to change notification settings - Fork 835
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
walk: Use unbounded channels #1414
Conversation
An extra benchmark to justify closing #1408: $ hyperfine -w2 fd{,-{master,after}}" -u . /tmp/empty"
Benchmark 1: fd -u . /tmp/empty
Time (mean ± σ): 148.9 ms ± 12.1 ms [User: 7.5 ms, System: 142.8 ms]
Range (min … max): 129.4 ms … 165.2 ms 18 runs
Benchmark 2: fd-master -u . /tmp/empty
Time (mean ± σ): 73.8 ms ± 10.0 ms [User: 4.8 ms, System: 72.4 ms]
Range (min … max): 57.3 ms … 83.9 ms 34 runs
Benchmark 3: fd-after -u . /tmp/empty
Time (mean ± σ): 5.2 ms ± 0.7 ms [User: 1.6 ms, System: 7.8 ms]
Range (min … max): 1.8 ms … 7.2 ms 268 runs
Summary
fd-after -u . /tmp/empty ran
14.22 ± 2.71 times faster than fd-master -u . /tmp/empty
28.68 ± 4.50 times faster than fd -u . /tmp/empty |
Wow 🤩 I would like to understand this first before merging. The situation is the following: we have a MPSC scenario where the consumer is sometimes too slow to handle the incoming results (even though its only job is to print the paths to a console). In the past, we used an unbounded channel which would lead to high memory usage because the channel was buffering all those results. Then we switched to a bounded channel, but that came at a high initialization cost because the bounded channels pre-allocate a fixed size buffer for the messages. The messages are In this changeset, you switch back to an unbounded channel, but Don't get me wrong. I love that this works. I'm just a bit puzzled that this isn't a strategy that could be used to speed up bounded channels in general (or maybe it is?). Would another allocator help? I'm not really knowledgeable here but it seems to me like this whole recycling-memory part should/could be the job of a (special-purpose) allocator? |
:)
Exactly. It was actually fairly easy to cause this by stalling the receiver thread with something like
Yeah that times the number of threads: Line 60 in 15329f9
I'm not sure exactly why
Yeah in my benchmarks it's universally faster. But I don't know the separate impacts of
Actually pub struct WorkerMsg<'a> {
inner: Option<ResultBox>,
tx: &'a Sender<ResultBox>,
} but I'd have to plumb the lifetime through more stuff.
Maybe it's possible? Keep in mind this is somewhat different semantics because each thread is strictly limited to its pool of 0x4000
It's possible that https://github.com/microsoft/snmalloc would have some of the same benefits. But here's the thing: this approach needs the receiver thread to tell the sender thread that it has handled a But actually, an SPSC channel scales better than a semaphore! That's because the head/tail index can be on separate cache lines. For a semaphore, both threads are always contending on the same counter. So for that reason I just went with the channel implementation (and because we already had it available; I'd have to go find/write a semaphore otherwise). |
Update: I just tried a hacky version of this and it wasn't appreciably faster |
I suspect this is a pretty big component of it. Unless the worker generates results faster than the receiver can process them, and we get to the maximum amount of allocation. It probably also helps that allocations for the channels now happens on the individual threads instead of in the spawning thread. Looking at the code for crossbeam, it isn't just that we have to allocate the memory for the channel up front, we also have to initialize it, and it can't be done with
I think it could be used in general. Basically implement a bounded channel as two unbounded channels and an atomic counter for the number of items allocated. And have the receiver pass the slot back after reading the value out of it. I'm not sure if that would universally improve performance, but it does have a couple of advantages: memory is allocated lazily, and it would be possible to dynamically change the size of the bound. |
76e8437
to
0fbbaae
Compare
Well it could be done with /// A slot in a channel.
struct Slot<T> {
/// The current stamp.
stamp: AtomicUsize,
/// The message in this slot.
msg: UnsafeCell<MaybeUninit<T>>,
}
impl<T> Slot<T> {
fn new() -> Self {
stamp: AtomicUsize::new(),
msg: UnsafeCell::new(MaybeUninit::zeroed()),
}
}
...
let slot = unsafe { self.buffer.get_unchecked(index) };
let stamp = slot.stamp.load(Ordering::Acquire) + index; And if they did that they could allocate the whole buffer with |
We originally switched to bounded channels for backpressure to fix sharkdp#918. However, bounded channels have a significant initialization overhead as they pre-allocate a fixed-size buffer for the messages. This implementation uses a different backpressure strategy: each thread gets a limited-size pool of WorkerResults. When the size limit is hit, the sender thread has to wait for the receiver thread to handle a result from that pool and recycle it. Inspired by [snmalloc], results are recycled by sending the boxed result over a channel back to the thread that allocated it. By allocating and freeing each WorkerResult from the same thread, allocator contention is reduced dramatically. And since we now pass results by pointer instead of by value, message passing overhead is reduced as well. Fixes sharkdp#1408. [snmalloc]: https://github.com/microsoft/snmalloc
I did some benchmarks comparing current master (8bbbd76) with this branch (d588971). I can confirm the huge increase in startup speed (
Unfortunately, there seems to be a large regression for longer searches with many results (
Edit: a quick |
Indeed... most of my benchmarks are with the pattern tavianator@tachyon $ hyperfine -w2 fd{,-{master,unbounded}}" -u --search-path ~/code/bfs/bench/corpus/chromium"
Benchmark 1: fd -u --search-path ~/code/bfs/bench/corpus/chromium
Time (mean ± σ): 2.570 s ± 0.013 s [User: 10.525 s, System: 99.826 s]
Range (min … max): 2.550 s … 2.590 s 10 runs
Benchmark 2: fd-master -u --search-path ~/code/bfs/bench/corpus/chromium
Time (mean ± σ): 796.4 ms ± 15.2 ms [User: 11590.2 ms, System: 3956.6 ms]
Range (min … max): 773.4 ms … 820.2 ms 10 runs
Benchmark 3: fd-unbounded -u --search-path ~/code/bfs/bench/corpus/chromium
Time (mean ± σ): 1.065 s ± 0.065 s [User: 7.692 s, System: 5.038 s]
Range (min … max): 0.986 s … 1.162 s 10 runs
Summary
fd-master -u --search-path ~/code/bfs/bench/corpus/chromium ran
1.34 ± 0.09 times faster than fd-unbounded -u --search-path ~/code/bfs/bench/corpus/chromium
3.23 ± 0.06 times faster than fd -u --search-path ~/code/bfs/bench/corpus/chromium Let me try a non-hacky version of #1414 (comment) ... |
I just pushed a couple more commits with different strategies, including passing senders by reference and my own little semaphore implementation. Unfortunately, it seems like most of the overhead is actually from the unbounded channel implementation itself, i.e. initialization time is faster but sending is slower. Let me try a different approach. |
I wonder if it would be worthwhile to try to reduce the initialization overhead in crossbeam channel. |
A couple of other ideas: We could reduce the size of the channel, which would speed up initialization, but might hurt performance if the bound is the bottleneck. Instead of having a single channel, we could have each sender thread create its own channel, that it passes back to the main thread via another channel. Then in the receiver we either select from all the channels, or if we spawn multiple receiver threads, each receiver could process a single input channel. |
I've tried this before, usually it's a perf loss
This was the "different approach" I mentioned above. It's much slower. But I just tried another idea: instead of individual
|
Includes #1413. The relevant commit is 5200718:
I benchmarked this with
bfs
's benchmark suite. Bothfd
s were built against a version ofignore
that included BurntSushi/ripgrep@d938e95, so we won't actually see performance quite this good until a newignore
release happens. Still, the results are good enough IMO that this fixes #1408 and #1362.Benchmark results (updated)
Complete traversal
linux v6.5 (86,380 files)
bfs bench/corpus/linux -false
find bench/corpus/linux -false
fd -u '^$' bench/corpus/linux
fd-master -u '^$' bench/corpus/linux
fd-unbounded -u '^$' bench/corpus/linux
rust 1.72.1 (192,714 files)
bfs bench/corpus/rust -false
find bench/corpus/rust -false
fd -u '^$' bench/corpus/rust
fd-master -u '^$' bench/corpus/rust
fd-unbounded -u '^$' bench/corpus/rust
chromium 119.0.6036.2 (2,119,292 files)
bfs bench/corpus/chromium -false
find bench/corpus/chromium -false
fd -u '^$' bench/corpus/chromium
fd-master -u '^$' bench/corpus/chromium
fd-unbounded -u '^$' bench/corpus/chromium
Printing paths
Without colors
linux v6.5
bfs bench/corpus/linux
find bench/corpus/linux
fd -u --search-path bench/corpus/linux
fd-master -u --search-path bench/corpus/linux
fd-unbounded -u --search-path bench/corpus/linux
With colors
linux v6.5
bfs bench/corpus/linux -color
fd -u --search-path bench/corpus/linux --color=always
fd-master -u --search-path bench/corpus/linux --color=always
fd-unbounded -u --search-path bench/corpus/linux --color=always
Parallelism
rust 1.72.1
-j1
bfs -j1 bench/corpus/rust -false
fd -j1 -u '^$' bench/corpus/rust
fd-master -j1 -u '^$' bench/corpus/rust
fd-unbounded -j1 -u '^$' bench/corpus/rust
-j2
bfs -j2 bench/corpus/rust -false
fd -j2 -u '^$' bench/corpus/rust
fd-master -j2 -u '^$' bench/corpus/rust
fd-unbounded -j2 -u '^$' bench/corpus/rust
-j3
bfs -j3 bench/corpus/rust -false
fd -j3 -u '^$' bench/corpus/rust
fd-master -j3 -u '^$' bench/corpus/rust
fd-unbounded -j3 -u '^$' bench/corpus/rust
-j4
bfs -j4 bench/corpus/rust -false
fd -j4 -u '^$' bench/corpus/rust
fd-master -j4 -u '^$' bench/corpus/rust
fd-unbounded -j4 -u '^$' bench/corpus/rust
-j6
bfs -j6 bench/corpus/rust -false
fd -j6 -u '^$' bench/corpus/rust
fd-master -j6 -u '^$' bench/corpus/rust
fd-unbounded -j6 -u '^$' bench/corpus/rust
-j8
bfs -j8 bench/corpus/rust -false
fd -j8 -u '^$' bench/corpus/rust
fd-master -j8 -u '^$' bench/corpus/rust
fd-unbounded -j8 -u '^$' bench/corpus/rust
-j12
bfs -j12 bench/corpus/rust -false
fd -j12 -u '^$' bench/corpus/rust
fd-master -j12 -u '^$' bench/corpus/rust
fd-unbounded -j12 -u '^$' bench/corpus/rust
-j16
bfs -j16 bench/corpus/rust -false
fd -j16 -u '^$' bench/corpus/rust
fd-master -j16 -u '^$' bench/corpus/rust
fd-unbounded -j16 -u '^$' bench/corpus/rust
Details
Versions