GGML threadpool blocked when thread contention #2725

sandrohanea · 2025-01-12T17:59:55Z

Identified initially this issue when I tried to upgrade Whisper.net to latest Whisper.cpp and ggml in sandrohanea/whisper.net#319 and observed that support for MSVC was removed.

I started to investigate this issue but I'm a bit blocked as I'm not that proficient in C++.

I added a simple repro example here: sandrohanea@07edc83

Same code, compiled with MSVC works:

    cmake -S . -B build/msvc -A x64 -DGGML_NATIVE=OFF
    make --build build/msvc --config Release

But when using clang, it's not working:

    cmake -S . -B build/winx64 -G "Ninja Multi-Config" -D CMAKE_TOOLCHAIN_FILE="../cmake/x64-windows-llvm.cmake" -DGGML_NATIVE=OFF
    cmake --build build/winx64 --config Release

where ../cmake/x64-windows-llvm.cmake is: https://github.com/ggerganov/llama.cpp/blob/master/cmake/x64-windows-llvm.cmake

Same setup is working with the MSVC build:

What's more interesting is that the deadlock from the clang process is blocking that entire CPU and causing all other processes of whisper to fail if using the same CPU (however, when the process for the clang build is killed with Ctrl + C, the other one resumes):

Not sure if the same behavior will be observed on Arm64 for Clang or just on x64, but Arm64 can be built now only with Clang, so it would be great if someone can test this.

Thank you!

The text was updated successfully, but these errors were encountered:

ggerganov · 2025-01-12T18:19:40Z

Can you reproduce with any of the whisper.cpp examples, so that we can make sure the problem is in whisper.cpp and not in whisper.net.

sandrohanea · 2025-01-12T18:22:22Z

Can you reproduce with any of the whisper.cpp examples, so that we can make sure the problem is in whisper.cpp and not in whisper.net.

The repro that I added here is just a simple C++ example (no whisper.net involved):

sandrohanea@07edc83#diff-8012be6a28736dd93582661245965820609c958d8d97ae03750257a2e11f35e1R34

It's basically just starting multiple inference calls in parallel:


#include "common.h"
#include "whisper.h"
#include "grammar-parser.h"
#include <string>
#include <thread>
#include <iostream>
using namespace std;
void RunInference(whisper_context* context, int index)
{
    auto input = new float[16000 * 2];
    for (int i = 0; i < 16000 * 2; i++) {
        input[i] = (float)rand() / RAND_MAX * 2 - 1;
    }
    // Generating random wave file with float (1, -1) as input (2 seconds at 16kHz)
    cout << "Staring inference number " << index << endl;
    whisper_state* state = whisper_init_state(context);
    auto params = whisper_full_default_params(WHISPER_SAMPLING_BEAM_SEARCH);
    whisper_full_with_state(context, state, params, input, 16000 * 2);
    cout << "Finished inference number " << index << endl;
    whisper_free_state(state);
}
int main(int argc, char** argv) {
    cout << "Waiting for setting the affinity context on the CPU to make sure only one CPU is used" << endl;
    cin.get();
    whisper_context* ctx = whisper_init_from_file("C:\\Projects\\sandrohanea\\whisper.net\\whisper.cpp\\models\\ggml-tiny.bin");
    // Optional Inference call to warm-up the runtime (but the issue reproduces with or without this)
    RunInference(ctx, 0);
    // Starting some inference calls in parallel
    vector<thread> threads;
    for (int i = 0; i < 4; i++) {
        threads.push_back(thread(RunInference, ctx, i));
    }
    for (auto& t : threads) {
        t.join();
    }
    whisper_free(ctx);
}

ggerganov · 2025-01-12T18:48:40Z

The thread sanitizer reports a couple of data races, though I'm not sure if these could cause the problem that you observe:

==================
WARNING: ThreadSanitizer: data race (pid=28458)
  Write of size 4 at 0x0001070f0118 by thread T528:
    #0 ggml_threadpool_new_impl ggml-cpu.c:13767 (libggml-cpu.dylib:arm64+0x1cf7c)
    #1 ggml_graph_compute ggml-cpu.c:13837 (libggml-cpu.dylib:arm64+0x1ec20)
    #2 ggml_backend_cpu_graph_compute(ggml_backend*, ggml_cgraph*) ggml-cpu.cpp:158 (libggml-cpu.dylib:arm64+0x193410)
    #3 ggml_backend_graph_compute_async ggml-backend.cpp:332 (libggml-base.dylib:arm64+0xc8430)
    #4 ggml_backend_sched_compute_splits(ggml_backend_sched*) ggml-backend.cpp:1397 (libggml-base.dylib:arm64+0xe94a8)
    #5 ggml_backend_sched_graph_compute_async ggml-backend.cpp:1588 (libggml-base.dylib:arm64+0xe6aa0)
    #6 ggml_backend_sched_graph_compute ggml-backend.cpp:1572 (libggml-base.dylib:arm64+0xe6748)
    #7 ggml_graph_compute_helper(ggml_backend_sched*, ggml_cgraph*, int) whisper.cpp:183 (libwhisper.1.7.4.dylib:arm64+0x38f60)

  Previous read of size 4 at 0x0001070f0118 by thread T527:
    #0 ggml_threadpool_free ggml-cpu.c:13319 (libggml-cpu.dylib:arm64+0x12874)
    #1 ggml_graph_compute ggml-cpu.c:13884 (libggml-cpu.dylib:arm64+0x1f3ec)
    #2 ggml_backend_cpu_graph_compute(ggml_backend*, ggml_cgraph*) ggml-cpu.cpp:158 (libggml-cpu.dylib:arm64+0x193410)
    #3 ggml_backend_graph_compute_async ggml-backend.cpp:332 (libggml-base.dylib:arm64+0xc8430)
    #4 ggml_backend_sched_compute_splits(ggml_backend_sched*) ggml-backend.cpp:1397 (libggml-base.dylib:arm64+0xe94a8)
    #5 ggml_backend_sched_graph_compute_async ggml-backend.cpp:1588 (libggml-base.dylib:arm64+0xe6aa0)
    #6 ggml_backend_sched_graph_compute ggml-backend.cpp:1572 (libggml-base.dylib:arm64+0xe6748)
    #7 ggml_graph_compute_helper(ggml_backend_sched*, ggml_cgraph*, int) whisper.cpp:183 (libwhisper.1.7.4.dylib:arm64+0x38f60)

  Thread T528 (tid=16157693, running) created by main thread at:
    #0 pthread_create <null>:78651984 (libclang_rt.tsan_osx_dynamic.dylib:arm64e+0x309d8)
    #1 std::__1::__libcpp_thread_create[abi:ne180100](_opaque_pthread_t**, void* (*)(void*), void*) __threading_support:317 (whisper-repro:arm64+0x1000090c4)

  Thread T527 (tid=16157692, running) created by main thread at:
    #0 pthread_create <null>:78651984 (libclang_rt.tsan_osx_dynamic.dylib:arm64e+0x309d8)
    #1 std::__1::__libcpp_thread_create[abi:ne180100](_opaque_pthread_t**, void* (*)(void*), void*) __threading_support:317 (whisper-repro:arm64+0x1000090c4)

SUMMARY: ThreadSanitizer: data race ggml-cpu.c:13767 in ggml_threadpool_new_impl
==================
==================
WARNING: ThreadSanitizer: data race (pid=28458)
  Write of size 4 at 0x0001070f0124 by thread T528:
    #0 ggml_threadpool_new_impl ggml-cpu.c:13769 (libggml-cpu.dylib:arm64+0x1d218)
    #1 ggml_graph_compute ggml-cpu.c:13837 (libggml-cpu.dylib:arm64+0x1ec20)
    #2 ggml_backend_cpu_graph_compute(ggml_backend*, ggml_cgraph*) ggml-cpu.cpp:158 (libggml-cpu.dylib:arm64+0x193410)
    #3 ggml_backend_graph_compute_async ggml-backend.cpp:332 (libggml-base.dylib:arm64+0xc8430)
    #4 ggml_backend_sched_compute_splits(ggml_backend_sched*) ggml-backend.cpp:1397 (libggml-base.dylib:arm64+0xe94a8)
    #5 ggml_backend_sched_graph_compute_async ggml-backend.cpp:1588 (libggml-base.dylib:arm64+0xe6aa0)
    #6 ggml_backend_sched_graph_compute ggml-backend.cpp:1572 (libggml-base.dylib:arm64+0xe6748)
    #7 ggml_graph_compute_helper(ggml_backend_sched*, ggml_cgraph*, int) whisper.cpp:183 (libwhisper.1.7.4.dylib:arm64+0x38f60)

  Previous read of size 4 at 0x0001070f0124 by thread T2171:
    #0 ggml_graph_compute_poll_for_work ggml-cpu.c:13648 (libggml-cpu.dylib:arm64+0x26b0c)
    #1 ggml_graph_compute_check_for_work ggml-cpu.c:13661 (libggml-cpu.dylib:arm64+0x26550)
    #2 ggml_graph_compute_secondary_thread ggml-cpu.c:13704 (libggml-cpu.dylib:arm64+0x25ec8)

  Thread T528 (tid=16157693, running) created by main thread at:
    #0 pthread_create <null>:78650976 (libclang_rt.tsan_osx_dynamic.dylib:arm64e+0x309d8)
    #1 std::__1::__libcpp_thread_create[abi:ne180100](_opaque_pthread_t**, void* (*)(void*), void*) __threading_support:317 (whisper-repro:arm64+0x1000090c4)

  Thread T2171 (tid=16159873, finished) created by thread T527 at:
    #0 pthread_create <null>:78650976 (libclang_rt.tsan_osx_dynamic.dylib:arm64e+0x309d8)
    #1 ggml_threadpool_new_impl ggml-cpu.c:13798 (libggml-cpu.dylib:arm64+0x1e034)
    #2 ggml_graph_compute ggml-cpu.c:13837 (libggml-cpu.dylib:arm64+0x1ec20)
    #3 ggml_backend_cpu_graph_compute(ggml_backend*, ggml_cgraph*) ggml-cpu.cpp:158 (libggml-cpu.dylib:arm64+0x193410)
    #4 ggml_backend_graph_compute_async ggml-backend.cpp:332 (libggml-base.dylib:arm64+0xc8430)
    #5 ggml_backend_sched_compute_splits(ggml_backend_sched*) ggml-backend.cpp:1397 (libggml-base.dylib:arm64+0xe94a8)
    #6 ggml_backend_sched_graph_compute_async ggml-backend.cpp:1588 (libggml-base.dylib:arm64+0xe6aa0)
    #7 ggml_backend_sched_graph_compute ggml-backend.cpp:1572 (libggml-base.dylib:arm64+0xe6748)
    #8 ggml_graph_compute_helper(ggml_backend_sched*, ggml_cgraph*, int) whisper.cpp:183 (libwhisper.1.7.4.dylib:arm64+0x38f60)

SUMMARY: ThreadSanitizer: data race ggml-cpu.c:13769 in ggml_threadpool_new_impl
==================
Finished inference number 3
Finished inference number 1
Finished inference number 0

Does it work with GGML_OPENMP=OFF?

slaren · 2025-01-12T19:02:20Z

however, when the process for the clang build is killed with Ctrl + C, the other one resumes

That doesn't look like a deadlock, it seems more likely that the performance is just very bad when using more threads than available. This is not completely unexpected, many parts need cooperation of all the threads to proceed, and if there are not enough threads available, then what's going to happen is that it is going to spin for a while, lose its time slice, then the next thread is going to be scheduled and waste more time spinning and so on. I am not sure that this is something that needs to be fixed, just do not use more threads than available.

sandrohanea · 2025-01-12T19:02:37Z

It seems that GGML_OPENMP=OFF makes the MSVC build to fail as well.

So, most likely the problem is with GGML_OPENMP=OFF and clang just couldn't find openmp while MSVC can.
I'm trying to confirm now if running clang build with GGML_OPENMP=ON will work

sandrohanea · 2025-01-12T19:05:52Z

however, when the process for the clang build is killed with Ctrl + C, the other one resumes

That doesn't look like a deadlock, it seems more likely that the performance is just very bad when using more threads than available. This is not completely unexpected, many parts need cooperation of all the threads to proceed, and if there are not enough threads available, then what's going to happen is that it is going to spin for a while, lose its time slice, then the next thread is going to be scheduled and waste more time spinning and so on. I am not sure that this is something that needs to be fixed, just do not use more threads than available.

It stayed for more than 5 hours on github actions (here: https://github.com/sandrohanea/whisper.net/actions/runs/12726221171/job/35474406671) and not one tests succeeded (which are usually finishing in a couple of seconds).

It's not impossible to be a problem with the performance, but such a big difference seems too much to me.

slaren · 2025-01-12T19:16:12Z

Much of the synchronization without OpenMP uses spin locks, which is great for performance when all threads are running at the same time, but it is susceptible to very bad results when there is thread contention. The thread is waiting for other threads to finish their job by wasting time spinning, but actually there are no other threads running, and the only way for the other threads to finish their jobs is for the current thread to give up its time slice. It may actually be a deadlock in this case, it's not completely out of the question, but it is something that should be easy to avoid by not using more threads than available.

slaren · 2025-01-12T19:35:12Z

@max-krasnyansky tagging you in case this is something you are interested in improving. The problem seems to be that when using the ggml threadpool when there is thread contention, the threads fail to make any progress.

sandrohanea · 2025-01-12T19:55:56Z

Much of the synchronization without OpenMP uses spin locks, which is great for performance when all threads are running at the same time, but it is susceptible to very bad results when there is thread contention. The thread is waiting for other threads to finish their job by wasting time spinning, but actually there are no other threads running, and the only way for the other threads to finish their jobs is for the current thread to give up its time slice. It may actually be a deadlock in this case, it's not completely out of the question, but it is something that should be easy to avoid by not using more threads than available.

In general, the number of running threads is smaller than the number of physical threads. The default parameters are already set to limit the number of threads to at most the number of physical threads, as seen here:

whisper.cpp/src/whisper.cpp

Line 4644 in 2ab2eb5

    
           /*.n_threads         =*/ std::min(4, (int32_t) std::thread::hardware_concurrency()),

However, in some environments, it might not be possible to determine the number of available threads (e.g., within certain containers, such as in a GitHub Actions environment), where the thread affinity is managed by the host.

I agree it is not ideal, and we can mitigate this by hardcoding the max number of threads to 1 if we don't control the environment where these will run. However, it would be great if ggml could handle thread contention cases more gracefully, allowing the processing to resume even under such conditions.

(renamed the issue as it is not related to clang)

Thanks Georgi and Diego for the quick answers and investigation!

max-krasnyansky · 2025-01-14T06:03:39Z

@max-krasnyansky tagging you in case this is something you are interested in improving. The problem seems to be that when using the ggml threadpool when there is thread contention, the threads fail to make any progress.

Yep. Sorry for the delayed reply.
When we control the affinity explicitly (--cpu-mask option) we have a warning when the user provides fewer cores (bits) than n_threads.
We could also add a warning for n_online_cpus < n_threads, but if the user/system externally forces affinity to some limited number cores it'll be expensive to detect that. I'll dig some more into what OpenMP does in that case. Perhaps, they have some simple trick we can use.

sandrohanea changed the title ~~Clang build for Windows x64 is causing deadlock / spinlock when affinity is set to a single CPU~~ GGML threadpool blocked when thread contention Jan 12, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GGML threadpool blocked when thread contention #2725

GGML threadpool blocked when thread contention #2725

sandrohanea commented Jan 12, 2025

ggerganov commented Jan 12, 2025

sandrohanea commented Jan 12, 2025

ggerganov commented Jan 12, 2025

slaren commented Jan 12, 2025 •

edited

Loading

sandrohanea commented Jan 12, 2025

sandrohanea commented Jan 12, 2025

slaren commented Jan 12, 2025

slaren commented Jan 12, 2025

sandrohanea commented Jan 12, 2025

max-krasnyansky commented Jan 14, 2025

GGML threadpool blocked when thread contention #2725

GGML threadpool blocked when thread contention #2725

Comments

sandrohanea commented Jan 12, 2025

ggerganov commented Jan 12, 2025

sandrohanea commented Jan 12, 2025

ggerganov commented Jan 12, 2025

slaren commented Jan 12, 2025 • edited Loading

sandrohanea commented Jan 12, 2025

sandrohanea commented Jan 12, 2025

slaren commented Jan 12, 2025

slaren commented Jan 12, 2025

sandrohanea commented Jan 12, 2025

max-krasnyansky commented Jan 14, 2025

slaren commented Jan 12, 2025 •

edited

Loading