Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

diskus slower than du #38

Open
tesuji opened this issue Oct 29, 2019 · 11 comments
Open

diskus slower than du #38

tesuji opened this issue Oct 29, 2019 · 11 comments
Labels
bug Something isn't working enhancement New feature or request good first issue Good for newcomers help wanted Extra attention is needed

Comments

@tesuji
Copy link

tesuji commented Oct 29, 2019

Maybe I were doing it wrong.
The computed directory is my clippy build.

% /usr/bin/du -sch
4.5G    .
4.5G    total
% diskus
4.73 GB (4,727,521,280 bytes)
% hyperfine diskus '/usr/bin/du -sch'
Benchmark #1: diskus
  Time (mean ± σ):     115.8 ms ±  28.6 ms    [User: 2.601 s, System: 0.592 s]
  Range (min … max):    69.1 ms … 156.9 ms    19 runs

Benchmark #2: /usr/bin/du -sch
  Time (mean ± σ):      22.8 ms ±   2.8 ms    [User: 5.5 ms, System: 17.4 ms]
  Range (min … max):    14.2 ms …  26.9 ms    163 runs

Summary
  '/usr/bin/du -sch' ran
    5.07 ± 1.40 times faster than 'diskus'

Meta

  • diskus: b2e4cf9 but with cargo update
@sharkdp
Copy link
Owner

sharkdp commented Oct 29, 2019

Interesting, thank you for reporting this.

First off, please always use the --warmup option of hyperfine (or perform a cold-cache benchmark), see https://github.com/sharkdp/diskus#warm-disk-cache

But even with that out of the way, diskus seems to be much slower here. I am assuming you did a normal release build (with optimizations) via cargo install --path .?

What kind of disk is your folder on? Or is it mounted via network/etc.?

It could be related to the optimal number of threads. Could you please run this parametrized benchmark?

hyperfine -w5 -P threads 1 16 "diskus -j {threads}" --export-markdown /tmp/results.md

and post the content of /tmp/results.md here?

By the way: if you want to see both tools report the exact same size, use du -sc -B1 and diskus or, alternatively, use du -scb and diskus -b.

@tesuji
Copy link
Author

tesuji commented Oct 29, 2019

The following results are run on a different computer

% /usr/bin/du -sh
4.3G    .
% diskus
4.52 GB (4,518,088,704 bytes)

use the --warmup option of hyperfine

I got a kind of similar result:

% hyperfine --warmup 5 'diskus' 'du -sh'
Benchmark #1: diskus
  Time (mean ± σ):     102.5 ms ±  27.5 ms    [User: 1.899 s, System: 0.551 s]
  Range (min … max):    57.2 ms … 156.5 ms    21 runs

Benchmark #2: du -sh
  Time (mean ± σ):      33.0 ms ±   2.9 ms    [User: 12.1 ms, System: 20.9 ms]
  Range (min … max):    25.3 ms …  36.9 ms    97 runs

Summary
  'du -sh' ran
    3.11 ± 0.88 times faster than 'diskus'

you did a normal release build

Yes, I did cargo build --release.

What kind of disk is your folder on? Or is it mounted via network/etc.?

Honestly, I don't know either. It is a shared server. I could provide more info
if you give instructions.

% df /home
Filesystem            Size  Used Avail Use% Mounted on
/dev/mapper/lvm-home  2.7T  430G  2.2T  17% /home
% lsblk
NAME         MOUNTPOINT LABEL  SIZE UUID
sda                            2.7T
├─sda1                           1M
├─sda2       /boot             488M d3ad2fe7-0903-4329-a944-bd694a619fea
└─sda3                         2.7T 5SIphy-NV0U-3NIU-VG4i-dVPD-AzaF-q61CjS
  ├─lvm-swap [SWAP]            1.9G b9df322f-b43c-4bd1-b214-d70f52abbd66
  ├─lvm-root /                19.5G 7a8bb72e-96e8-4280-945e-25d9a0931443
  └─lvm-home /home             2.7T 314bd59b-e6e9-4ad5-a2db-ab581190c891

run this parametrized benchmark

Yes, here is the result:

Command Mean [ms] Min [ms] Max [ms] Relative
diskus -j 1 36.9 ± 0.8 35.6 39.5 4.71 ± 0.47
diskus -j 2 19.3 ± 1.2 17.9 23.2 2.46 ± 0.28
diskus -j 3 13.9 ± 1.4 12.5 20.4 1.78 ± 0.25
diskus -j 4 11.3 ± 1.1 9.8 15.9 1.45 ± 0.20
diskus -j 5 9.9 ± 1.0 8.3 13.9 1.26 ± 0.18
diskus -j 6 8.9 ± 0.9 7.3 12.6 1.14 ± 0.16
diskus -j 7 8.4 ± 0.9 6.8 11.2 1.08 ± 0.16
diskus -j 8 8.1 ± 0.9 6.3 10.8 1.04 ± 0.15
diskus -j 9 7.9 ± 0.8 6.4 10.6 1.01 ± 0.14
diskus -j 10 7.9 ± 0.7 6.4 10.3 1.01 ± 0.14
diskus -j 11 7.9 ± 0.7 6.3 10.2 1.00 ± 0.13
diskus -j 12 7.8 ± 0.8 6.4 10.9 1.00
diskus -j 13 7.9 ± 0.6 6.5 9.9 1.01 ± 0.13
diskus -j 14 8.2 ± 1.3 6.5 14.5 1.05 ± 0.19
diskus -j 15 8.1 ± 0.8 6.5 14.4 1.03 ± 0.14
diskus -j 16 8.7 ± 0.9 7.1 14.8 1.11 ± 0.16

@sharkdp
Copy link
Owner

sharkdp commented Oct 29, 2019

Wait, so diskus is much faster with a correctly set number of threads?

Do you happen to have a massive number of CPU cores? What does

nproc

say?

@tesuji
Copy link
Author

tesuji commented Oct 29, 2019

% nproc
32

@sharkdp
Copy link
Owner

sharkdp commented Oct 29, 2019

Oh 😄 That seems to be the cause of this. By default, diskus uses 3 * nproc threads to walk the filesystem (96 in your case). It seems like this heuristic doesn't hold for a large number of cores.

We should probably cap it at some value (32?). If you have the time, it would be great if you could run the full benchmark up to 96 threads and post the JSON results here:

hyperfine -w5 -P threads 1 96 "diskus -j {threads}" --export-json /tmp/results.json

I would assume that the time slowly increases to 100 ms when the number of threads gets higher.

(a cold cache benchmark for comparison would also be great, but I don't want to bother you).

@sharkdp sharkdp added bug Something isn't working enhancement New feature or request good first issue Good for newcomers help wanted Extra attention is needed labels Oct 29, 2019
@tesuji
Copy link
Author

tesuji commented Oct 30, 2019

a cold cache benchmark

Sorry, I couldn't make it because it is a shared server. I don't have sudo privilege.

run the full benchmark up to 96 threads

Here is the result: https://gist.github.com/lzutao/7b86122495608f9096ac692553e2a038

@sharkdp
Copy link
Owner

sharkdp commented Oct 30, 2019

Ok, so the warm-cache runtime looks like this:

scaling

Choosing 96 threads by default is obviously not optimal here.

@matu3ba
Copy link

matu3ba commented Apr 13, 2020

Ok, so the warm-cache runtime looks like this:

scaling

Choosing 96 threads by default is obviously not optimal here.

This looks to me like the cores stalling, because they cant get memory access. Is this OS restricted ??
Memory restrictions usually look like an exponential curve to a limit (loading condensator), but this seems to be linear from the minimum to the limit.

@sharkdp
Copy link
Owner

sharkdp commented Apr 13, 2020

Is this OS restricted ??

That might be one reason. But I would rather guess that we are simply limited by the sequential nature of the disk (and cache) itself. There is a certain benefit in bombarding the IO scheduler with lots of requests (that is why we use multiple threads in the first place), but at some point the synchronization/context-switching overhead is probably just too high.

There is no really solid basis for the 3 * nproc heuristic that diskus uses. It's just something that seemed to work fine for all the machines I tested on. Things are complicated by the fact that the optimal number of threads is different for warm-cache and cold-cache runs. The 3 * nproc value was a tradeoff between the two:

image
(results from a 10GB folder on my 8-core laptop, warm-cache and cold-cache results normalized independently)

@matu3ba
Copy link

matu3ba commented Apr 13, 2020

Is this OS restricted ??

That might be one reason. But I would rather guess that we are simply limited by the sequential nature of the disk (and cache) itself. There is a certain benefit in bombarding the IO scheduler with lots of requests (that is why we use multiple threads in the first place), but at some point the synchronization/context-switching overhead is probably just too high.

There is sysinfo with disktype.
At least you could adapt to HDD and SSD speeds, but I am not sure if it is worth it.
Sadly they provide no method to obtain read/write speed for the disk and caches, because around 5% speedup from using the exact block size should be expected.

To my knowledge there exists no simple CPU-model to estimate context-switches and synchronization(on cache invalidations) etc, which is a shame, but expected regarding Spectre and similar. If you know otherwise, please tell me.

There is no really solid basis for the 3 * nproc heuristic that diskus uses. It's just something that seemed to work fine for all the machines I tested on. Things are complicated by the fact that the optimal number of threads is different for warm-cache and cold-cache runs. The 3 * nproc value was a tradeoff between the two:

Thanks.

@tesuji
Copy link
Author

tesuji commented Jul 28, 2020

Is this OS restricted ??

Yeah, likely, max open files is soft-limited to 1024.
Forget about what I said above, I ran the command on the wrong machine.
Here is the new result:

% cat /proc/$$/limits
Limit                     Soft Limit           Hard Limit           Units     
Max cpu time              unlimited            unlimited            seconds   
Max file size             unlimited            unlimited            bytes     
Max data size             unlimited            unlimited            bytes     
Max stack size            8388608              unlimited            bytes     
Max core file size        0                    unlimited            bytes     
Max resident set          unlimited            unlimited            bytes     
Max processes             579972               579972               processes 
Max open files            1024                 1048576              files     
Max locked memory         67108864             67108864             bytes     
Max address space         unlimited            unlimited            bytes     
Max file locks            unlimited            unlimited            locks     
Max pending signals       579972               579972               signals   
Max msgqueue size         819200               819200               bytes     
Max nice priority         0                    0                    
Max realtime priority     0                    0                    
Max realtime timeout      unlimited            unlimited            us        

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working enhancement New feature or request good first issue Good for newcomers help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

3 participants