Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

special casing norm=2 for speedup #96

Closed
wants to merge 1 commit into from

Conversation

jyrkialakuijala
Copy link
Collaborator

@jyrkialakuijala jyrkialakuijala commented Jun 13, 2024

Before:

Score type MSE Min score Max score Mean score
Zimtohrli 0.101641872045228 0.563959418943879 0.762245848804069 0.686398636984063
ViSQOL 0.115330916105424 0.520833375452983 0.801480831107469 0.675101633981268
2f 0.129541391104905 0.484687555319526 0.797475783883375 0.661870345773127
PESQ 0.147425552045669 0.342342966279351 0.841271127756762 0.647128996775172
CDPAM 0.153471222942756 0.441558428344727 0.728779141125759 0.620699318941738
PARLAQ 0.185057687192323 0.445261140223642 0.784370761057963 0.587162756572532
AQUA 0.223207996944378 0.331645933512413 0.739286336419790 0.547804951221731
PEAQB 0.225217321572038 0.278744167467764 0.851011116004117 0.553935720513487
DPAM 0.315810440183130 0.186717781679534 0.690564701717118 0.460415212267967
WARP-Q 0.339686211572685 0.067600137543649 0.777119464646524 0.475793617709890
GVPMOS 0.412937133868407 0.006851162794410 0.783946603687895 0.412912222208318

real 44m52.908s # not indicative as computation was blocked (progress bar filling the buffer?) for ~30 min
user 748m34.686s
sys 219m0.831s

After:

Score type MSE Min score Max score Mean score
Zimtohrli 0.101642192840456 0.563959418943879 0.762245848804069 0.686398039616168
ViSQOL 0.115330916105424 0.520833375452983 0.801480831107469 0.675101633981268
2f 0.129541391104905 0.484687555319526 0.797475783883375 0.661870345773127
PESQ 0.147425552045669 0.342342966279351 0.841271127756762 0.647128996775172
CDPAM 0.153471222942756 0.441558428344727 0.728779141125759 0.620699318941738
PARLAQ 0.185057687192323 0.445261140223642 0.784370761057963 0.587162756572532
AQUA 0.223207996944378 0.331645933512413 0.739286336419790 0.547804951221731
PEAQB 0.225217321572038 0.278744167467764 0.851011116004117 0.553935720513487
DPAM 0.315810440183130 0.186717781679534 0.690564701717118 0.460415212267967
WARP-Q 0.339686211572685 0.067600137543649 0.777119464646524 0.475793617709890
GVPMOS 0.412937133868407 0.006851162794410 0.783946603687895 0.412912222208318

real 13m52.362s
user 625m20.070s
sys 279m59.803s

17 % reduction in user time

some more speedup on the gammatone filter side

cpp/zimt/dtw.cc Show resolved Hide resolved
cpp/zimt/dtw.cc Show resolved Hide resolved
Before:

|Score type |MSE               |Min score         |Max score         |Mean score        |
|-----------|------------------|------------------|------------------|------------------|
|Zimtohrli  |0.101641872045228 |0.563959418943879 |0.762245848804069 |0.686398636984063 |
|ViSQOL     |0.115330916105424 |0.520833375452983 |0.801480831107469 |0.675101633981268 |
|2f         |0.129541391104905 |0.484687555319526 |0.797475783883375 |0.661870345773127 |
|PESQ       |0.147425552045669 |0.342342966279351 |0.841271127756762 |0.647128996775172 |
|CDPAM      |0.153471222942756 |0.441558428344727 |0.728779141125759 |0.620699318941738 |
|PARLAQ     |0.185057687192323 |0.445261140223642 |0.784370761057963 |0.587162756572532 |
|AQUA       |0.223207996944378 |0.331645933512413 |0.739286336419790 |0.547804951221731 |
|PEAQB      |0.225217321572038 |0.278744167467764 |0.851011116004117 |0.553935720513487 |
|DPAM       |0.315810440183130 |0.186717781679534 |0.690564701717118 |0.460415212267967 |
|WARP-Q     |0.339686211572685 |0.067600137543649 |0.777119464646524 |0.475793617709890 |
|GVPMOS     |0.412937133868407 |0.006851162794410 |0.783946603687895 |0.412912222208318 |

real	44m52.908s # not indicative as the terminal was blocked by
another window and consequently blocking the computation on my desktop

user	748m34.686s
sys	219m0.831s

After:

|Score type |MSE               |Min score         |Max score         |Mean score        |
|-----------|------------------|------------------|------------------|------------------|
|Zimtohrli  |0.101642192840456 |0.563959418943879 |0.762245848804069 |0.686398039616168 |
|ViSQOL     |0.115330916105424 |0.520833375452983 |0.801480831107469 |0.675101633981268 |
|2f         |0.129541391104905 |0.484687555319526 |0.797475783883375 |0.661870345773127 |
|PESQ       |0.147425552045669 |0.342342966279351 |0.841271127756762 |0.647128996775172 |
|CDPAM      |0.153471222942756 |0.441558428344727 |0.728779141125759 |0.620699318941738 |
|PARLAQ     |0.185057687192323 |0.445261140223642 |0.784370761057963 |0.587162756572532 |
|AQUA       |0.223207996944378 |0.331645933512413 |0.739286336419790 |0.547804951221731 |
|PEAQB      |0.225217321572038 |0.278744167467764 |0.851011116004117 |0.553935720513487 |
|DPAM       |0.315810440183130 |0.186717781679534 |0.690564701717118 |0.460415212267967 |
|WARP-Q     |0.339686211572685 |0.067600137543649 |0.777119464646524 |0.475793617709890 |
|GVPMOS     |0.412937133868407 |0.006851162794410 |0.783946603687895 |0.412912222208318 |

real	13m52.362s
user	625m20.070s
sys	279m59.803s

17 % reduction in user time

some more speedup on the gammatone filter side
@zond
Copy link
Collaborator

zond commented Jun 13, 2024

Before:

zond@nyarlathotep:~/projects/zimtohrli/build{main}$ git pull
Already up to date.
zond@nyarlathotep:~/projects/zimtohrli/build{main}$ ninja
ninja: no work to do.
zond@nyarlathotep:~/projects/zimtohrli/build{main}$ ./zimtohrli_benchmark "--benchmark_filter=.*DTW.*"
2024-06-13T14:48:54+00:00
Running ./zimtohrli_benchmark
Run on (128 X 2450 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x64)
  L1 Instruction 32 KiB (x64)
  L2 Unified 512 KiB (x64)
  L3 Unified 32768 KiB (x8)
Load Average: 5.70, 4.18, 3.95
----------------------------------------------------------------------------
Benchmark                  Time             CPU   Iterations UserCounters...
----------------------------------------------------------------------------
BM_DTW/1000          7813602 ns      7813625 ns           91 items_per_second=127.982k/s
BM_DTW/4096        130020184 ns    130017113 ns            5 items_per_second=31.5035k/s
BM_DTW/10000       768624511 ns    768597801 ns            1 items_per_second=13.0107k/s
BM_ChainDTW/1000     8229079 ns      8228690 ns           82 items_per_second=121.526k/s
BM_ChainDTW/4096    74764404 ns     74764579 ns            9 items_per_second=54.7853k/s
BM_ChainDTW/32768  987287343 ns    987277458 ns            1 items_per_second=33.1903k/s
BM_ChainDTW/50000 1546883011 ns   1546769024 ns            1 items_per_second=32.3254k/s

After:

zond@nyarlathotep:~/projects/zimtohrli/build{main}$ git checkout jyrki_pr
Switched to branch 'jyrki_pr'
zond@nyarlathotep:~/projects/zimtohrli/build{jyrki_pr}$ grep "float HwyDeltaNorm" -A 20 ../cpp/zimt/dtw.cc 
float HwyDeltaNorm(hwy::Span<const float> span_a, hwy::Span<const float> span_b,
                   float order, float max) {
  if (max == 0) {
    return 0;
  }

  CHECK_EQ(span_a.size(), span_b.size());

  const Vec order_vec = Set(d, order);
  const Vec max_reciprocal = Div(Set(d, 1), Set(d, max));
  double sum = 0;
  if (order == 2.0) {
    // Faster special case without Exp/Log for order == 2.0, the usual case.
    for (size_t index = 0; index < span_a.size(); index += Lanes(d)) {
      const Vec delta =
          Sub(Load(d, span_a.data() + index), Load(d, span_b.data() + index));
      const Vec pows = Mul(delta, delta);
      sum += static_cast<double>(ReduceSum(d, pows));
    }
    sum /= max * max;
  } else {
zond@nyarlathotep:~/projects/zimtohrli/build{jyrki_pr}$ ninja
[7/7] Generating /usr/local/google/home/zond/projects/zimtohrli/go/goohrli/goohrli.a
zond@nyarlathotep:~/projects/zimtohrli/build{jyrki_pr}$ ./zimtohrli_benchmark "--benchmark_filter=.*DTW.*"
2024-06-13T14:49:35+00:00
Running ./zimtohrli_benchmark
Run on (128 X 2450 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x64)
  L1 Instruction 32 KiB (x64)
  L2 Unified 512 KiB (x64)
  L3 Unified 32768 KiB (x8)
Load Average: 4.73, 4.10, 3.94
----------------------------------------------------------------------------
Benchmark                  Time             CPU   Iterations UserCounters...
----------------------------------------------------------------------------
BM_DTW/1000         14396009 ns     14395034 ns           44 items_per_second=69.4684k/s
BM_DTW/4096        261142888 ns    261139265 ns            2 items_per_second=15.6851k/s
BM_DTW/10000       881928227 ns    881898940 ns            1 items_per_second=11.3392k/s
BM_ChainDTW/1000    15358534 ns     15358366 ns           59 items_per_second=65.1111k/s
BM_ChainDTW/4096   142775649 ns    142762950 ns            4 items_per_second=28.6909k/s
BM_ChainDTW/32768 2594315531 ns   2594240967 ns            1 items_per_second=12.6311k/s
BM_ChainDTW/50000 3767449623 ns   3767133133 ns            1 items_per_second=13.2727k/s

@zond
Copy link
Collaborator

zond commented Jun 13, 2024

The benchmark was buggy.

I'm submitting #99 instead of this one, since it goes even further in removing unnecessary garbage.

@zond zond closed this Jun 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants