Skip to content

Conversation

@jenskeiner
Copy link
Contributor

@jenskeiner jenskeiner commented Sep 28, 2025

This PR adds threaded versions of the existing NDFT benchmarks.

I have removed the non-OpenMP build jobs where an OpenMP version is available since building with OpenMP is only additive. There's no benefit of keeping the non-OpenMP jobs because we will now just run multi-threaded benchmarks if possible, in addition to the regular ones.

@codspeed-hq
Copy link

codspeed-hq bot commented Sep 28, 2025

CodSpeed Performance Report

Merging #173 will not alter performance

Comparing feature/threaded_benchmarks (51a110a) with develop (ceacfa2)

Summary

✅ 66 untouched
🆕 176 new
⏩ 22 skipped1
🗄️ 110 archived benchmarks run2

Benchmarks breakdown

Benchmark BASE HEAD Change
🆕 ubuntu-latest_clang_kaiserbessel_double_openmp/nfft_adjoint_direct_1d_omp[128/400] N/A 4.6 ms N/A
🆕 ubuntu-latest_clang_kaiserbessel_double_openmp/nfft_adjoint_direct_1d_omp[256/800] N/A 16.2 ms N/A
🆕 ubuntu-latest_clang_kaiserbessel_double_openmp/nfft_adjoint_direct_1d_omp[32/100] N/A 1.7 ms N/A
🆕 ubuntu-latest_clang_kaiserbessel_double_openmp/nfft_adjoint_direct_1d_omp[512/1600] N/A 58 ms N/A
🆕 ubuntu-latest_clang_kaiserbessel_double_openmp/nfft_adjoint_direct_1d_omp[64/200] N/A 2.1 ms N/A
🆕 ubuntu-latest_clang_kaiserbessel_double_openmp/nfft_adjoint_direct_2d_omp[16/16/500] N/A 10.6 ms N/A
🆕 ubuntu-latest_clang_kaiserbessel_double_openmp/nfft_adjoint_direct_2d_omp[32/32/1000] N/A 75.8 ms N/A
🆕 ubuntu-latest_clang_kaiserbessel_double_openmp/nfft_adjoint_direct_2d_omp[64/64/2000] N/A 607.4 ms N/A
🆕 ubuntu-latest_clang_kaiserbessel_double_openmp/nfft_adjoint_direct_3d_omp[16/16/16/1000] N/A 310.7 ms N/A
🆕 ubuntu-latest_clang_kaiserbessel_double_openmp/nfft_adjoint_direct_3d_omp[4/4/4/250] N/A 2.5 ms N/A
🆕 ubuntu-latest_clang_kaiserbessel_double_openmp/nfft_adjoint_direct_3d_omp[8/8/8/500] N/A 19.6 ms N/A
🆕 ubuntu-latest_clang_kaiserbessel_double_openmp/nfft_forward_direct_1d_omp[128/400] N/A 3.7 ms N/A
🆕 ubuntu-latest_clang_kaiserbessel_double_openmp/nfft_forward_direct_1d_omp[256/800] N/A 13.7 ms N/A
🆕 ubuntu-latest_clang_kaiserbessel_double_openmp/nfft_forward_direct_1d_omp[32/100] N/A 1.1 ms N/A
🆕 ubuntu-latest_clang_kaiserbessel_double_openmp/nfft_forward_direct_1d_omp[512/1600] N/A 50.8 ms N/A
🆕 ubuntu-latest_clang_kaiserbessel_double_openmp/nfft_forward_direct_1d_omp[64/200] N/A 1.6 ms N/A
🆕 ubuntu-latest_clang_kaiserbessel_double_openmp/nfft_forward_direct_2d_omp[16/16/500] N/A 11 ms N/A
🆕 ubuntu-latest_clang_kaiserbessel_double_openmp/nfft_forward_direct_2d_omp[32/32/1000] N/A 78.3 ms N/A
🆕 ubuntu-latest_clang_kaiserbessel_double_openmp/nfft_forward_direct_2d_omp[64/64/2000] N/A 626.8 ms N/A
🆕 ubuntu-latest_clang_kaiserbessel_double_openmp/nfft_forward_direct_3d_omp[16/16/16/1000] N/A 312.5 ms N/A
... ... ... ... ...

ℹ️ Only the first 20 benchmarks are displayed. Go to the app to view all benchmarks.

Footnotes

  1. 22 benchmarks were skipped, so the baseline results were used instead. If they were deleted from the codebase, click here and archive them to remove them from the performance reports.

  2. 110 benchmarks were run, but are now archived. If they were deleted in another branch, consider rebasing to remove them from the report. Instead if they were added back, click here to restore them.

@jenskeiner jenskeiner marked this pull request as ready for review September 29, 2025 08:03
@jenskeiner jenskeiner force-pushed the feature/threaded_benchmarks branch from f75b4cd to 51a110a Compare September 29, 2025 08:20
@michaelquellmalz
Copy link
Member

Do you know why the benchmarks give so many warnings about system calls?

@jenskeiner jenskeiner changed the title Add threaded benchmarks. Add multi-threaded benchmarks. Sep 29, 2025
@jenskeiner jenskeiner added the test Test-related changes with no production code impact label Sep 29, 2025
@jenskeiner
Copy link
Contributor Author

jenskeiner commented Oct 6, 2025

Do you know why the benchmarks give so many warnings about system calls?

Not sure, I reckon it has to do with OpenMP overhead to coordinate the threads. The benchmark mode used runs the code in a virtual CPU and it looks like syscalls are a relevant factor. It's also possible to use wallclock time for the benchmarks, but I am not sure at this stage if there's any benefit for us here.

If you enable the syscalls in the flame graphs, you can see e.g. that sched_yield is responsible for much of the measured time. This is definitely related to thread coordination. Maybe using multiple threads for small sizes is too inefficient, or as the hints in the UI indicate, syscalls are just not measured accurately in the mode we ar eusing, so results can be misleading.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

test Test-related changes with no production code impact

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants