-
-
Notifications
You must be signed in to change notification settings - Fork 835
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fd is much slower when run with multiple threads #1131
Comments
Did you build |
Installed from source with cargo
|
Did you build it with the release profile? |
I got similar results, I've downloaded fd from void repositories which I believe are distributed under release profile. fd >>> hyperfine -w 5 --prepare 'echo 3 | sudo tee /proc/sys/vm/drop_caches' "fd" "fd -j1" -N
Benchmark 1: fd
Time (mean ± σ): 4.9 ms ± 0.7 ms [User: 2.0 ms, System: 3.7 ms]
Range (min … max): 4.1 ms … 7.0 ms 416 runs
Benchmark 2: fd -j1
Time (mean ± σ): 3.7 ms ± 0.2 ms [User: 1.7 ms, System: 2.2 ms]
Range (min … max): 3.3 ms … 5.0 ms 629 runs
Summary
'fd -j1' ran
1.33 ± 0.20 times faster than 'fd' |
Yes, Full Log
No, you haven't. Yours is only 1.3x slower, whereas for some reason for me it's over 100x slower. |
I'm guessing this has something to do with WSL. Maybe $ strace -cf fd >/dev/null
$ strace -cf fd -j1 >/dev/null |
Also what filesystem is this running in? |
Filesystem is
|
@aDotInTheVoid Please use perf for performance recordings as strace has significant overhead. WSL2 has severe and known performance issues, for example this one is specific for the filesystem: microsoft/WSL#4197 Only threading alone has up to 5x performance penalties dotnet/runtime#42994 More over, WSL2 is a full VM, so you will never get performance close to a native Linux Kernel https://learn.microsoft.com/en-us/windows/wsl/compare-versions. |
That's why I asked what filesystem was being used. Accessing Windows files over 9p is slow in WSL2, but the OP is accessing Linux files in an ext4 filesystem. Since this is just the regular Linux ext4 implementation, it should be just about as fast as native Linux (except for the actual I/O).
That looks potentially relevant. Thread-local storage performs poorly on WSL2 for some reason.
WSL2 uses the "Virtual Machine Platform", a subset of Hyper-V which is "something like KVM". It should be close to native performance for things that don't need to cross the hypervisor boundary often. |
Indeed. I asked for If you want to try $ perf trace record fd >/dev/null
$ perf trace -i perf.data -s (and same with Actually now that I think about it, timer resolution might be the issue. Perhaps short sleeps are becoming very long due to imprecision. |
I tried to find a simple way to profile sleep times. One way is # /usr/share/bcc/tools/offcputime -f | grep '^fd' >fd.log &
# fd >/dev/null
# pkill -INT offcputime (and repeat for |
I'm hitting this issue as well on FreeBSD 13.1:
|
I believe the bug I was about to report is actually this bug. I use When I add the I'm on NixOS unstable running on ZFS root. @matu3ba & @tavianator , this is NOT a filesystem problem (IMHO). This looks like a heuristics bug regarding non-specification of the number of concurrent jobs to run (per the mandocs, not specifying it "uses heuristics") At least I now know a workaround that works! |
@pmarreck I'm pretty sure that's not the same bug. I believe this one is specific to WSL. However, I do think limiting the number of threads by default makes sense. I have the 24 core/48 thread Threadripper and |
right, but isn't there something wrong with the "heuristics" mentioned in the manpages for not specifying the number of jobs, if simply not specifying how many threads you want it to use results in worse or even worst-case performance on a given CPU architecture? I don't even know if it's a configuration option, because I'd hate to have to specify it for literally every search request. (In my case it's not terribly inconvenient since I already use a wrapper function around it, but still.) Whatever "heuristics" are used should be torn out and just default to something like "half the number of cores with a max of 6/8" because after that you're probably bottlenecked on either disk I/O or the setup/teardown of the threads, anyway... The reason why I think it's related is simply because the given solution (specifying |
A flamegraph would be the perfect tool to analyze this: https://www.brendangregg.com/offcpuanalysis.html For the meantime: Since fd provides no shell completions anyway: Creating an alias should work around the problem for the time being. |
The current "heuristics" is just to use the number of CPU cores, as returned by num_cpus::get. Maybe it would make more sense to use I think it would make sense to have a maximum on that for the default number of threads, although I'm not sure what the best value of that would be. |
Yes. This article is relevant for that: https://www.codeguru.com/cplusplus/why-too-many-threads-hurts-performance-and-what-to-do-about-it/. For example mold uses statically linked intel tbb for that and here is a overall functionality overview. Does anything like this exist for Rust? A: No, see also this reddit thread https://www.reddit.com/r/rust/comments/p0a3mf/async_scheduler_optimised_for_highcompute/. I doubt that the complexity of async is worth it. |
@matu3ba This script works over here:
Output for me (after
It's about a 43x slowdown. At least with this number of detected CPU's. (I believe it's actually, technically, 64 CPUs and 128 threads, but anyway.) |
I would appreciate if we could calm down a bit 😄. The current default was not chosen without reason. It's based on benchmarks on my machine (8 core, see disclaimer concerning benchmarks in the README: one particular benchmark on one particular machine). You can see some past benchmark results here or here. Or I can run one right now, on a different machine (12-core): hyperfine \
--parameter-scan threads 1 16 \
--warmup 3 \
--export-json results.json \
"fd -j {threads}" As you can tell, using But I admit: startup time is a different story. In an empty directory, it looks like this: But if I have to choose, I would definitely lean towards making long(er) searches faster, instead of optimizing startup time... which is completely negligible unless you're running hundreds of searches inside tiny directories. But then you'Re probably using a script (where you can easily tune the number of Now all that being said: if the current strategy shows unfavorable benchmark results on machines with N_cores ≫ 8, I'd be happy to do implement something like Also, we digress. As @tavianator pointed out, this ticket is about WSL. So maybe let's get back to that topic and open a new ticket to discuss a better default |
agreed on all. and sorry for peppering this ticket with what probably deserves its own ticket! does an increasing thread count increase the startup time simply due to the cost of starting up and tearing down the threads? also, wouldn't one hit some I/O bottleneck pretty quickly past N threads? (where N is some low number that is certainly significantly below 32, for example?) I know my CPU is probably unusual, but the "lower" Threadrippers are probably not that uncommon (perhaps most notably, Linus Torvald's) |
There has been some work recently that should have improved this. |
True, I think this can be closed for now. Please report back if this should still be an issue. |
I'm not sure this is fixed. This particular report is WSL-specific. Someone should at least check fd 9.0 on WSL2 before we close it. (I only have a Windows VM, so WSL2 would be nested virt so probably not a fair test.) |
@tavianator @sharkdp I was tracking down why mold was running slower than lld and default linker in wsl2 and found this. Nonetheless: ~
❯ fd --version
fdfind 9.0.0
~
❯ hyperfine -w 50 "fd" "fd -j1" -N
Benchmark 1: fd
Time (mean ± σ): 5.8 ms ± 0.8 ms [User: 12.0 ms, System: 3.3 ms]
Range (min … max): 4.1 ms … 9.2 ms 480 runs
Benchmark 2: fd -j1
Time (mean ± σ): 8.0 ms ± 1.6 ms [User: 6.7 ms, System: 2.7 ms]
Range (min … max): 4.8 ms … 16.0 ms 526 runs
Warning: Statistical outliers were detected. Consider re-running this benchmark on a quiet system without any interferences from other programs. It might help to use the '--warmup' or '--prepare' options.
Summary
fd ran
1.38 ± 0.33 times faster than fd -j1 It seems improved. It's a hassle to optimize for with the weird threading behavior, but it's very much appreciated, thank you guys for all the hard work. and just for reference, on v8.7.1~
❯ hyperfine -w 50 "./fdfind/fd-v8.7.1-x86_64-unknown-linux-gnu/fd" "./fdfind/fd-v8.7.1-x86_64-unknown-linux-gnu/fd -j1"
-N
Benchmark 1: ./fdfind/fd-v8.7.1-x86_64-unknown-linux-gnu/fd
Time (mean ± σ): 24.4 ms ± 3.2 ms [User: 6.7 ms, System: 29.4 ms]
Range (min … max): 17.3 ms … 49.7 ms 114 runs
Warning: Statistical outliers were detected. Consider re-running this benchmark on a quiet system without any interferences from other programs. It might help to use the '--warmup' or '--prepare' options.
Benchmark 2: ./fdfind/fd-v8.7.1-x86_64-unknown-linux-gnu/fd -j1
Time (mean ± σ): 8.1 ms ± 1.5 ms [User: 4.3 ms, System: 2.8 ms]
Range (min … max): 5.3 ms … 13.0 ms 487 runs
Summary
./fdfind/fd-v8.7.1-x86_64-unknown-linux-gnu/fd -j1 ran
3.03 ± 0.68 times faster than ./fdfind/fd-v8.7.1-x86_64-unknown-linux-gnu/fd |
@WanderLanz Thanks for re-testing this! Looks like it's fixed. |
When not using
-j1
,fd
takes thousands of times longer.The text was updated successfully, but these errors were encountered: