Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GGML threadpool blocked when thread contention #2725

Open
sandrohanea opened this issue Jan 12, 2025 · 10 comments
Open

GGML threadpool blocked when thread contention #2725

sandrohanea opened this issue Jan 12, 2025 · 10 comments

Comments

@sandrohanea
Copy link
Contributor

Hello @ggerganov , @slaren ,

Identified initially this issue when I tried to upgrade Whisper.net to latest Whisper.cpp and ggml in sandrohanea/whisper.net#319 and observed that support for MSVC was removed.

I started to investigate this issue but I'm a bit blocked as I'm not that proficient in C++.

I added a simple repro example here: sandrohanea@07edc83

Same code, compiled with MSVC works:

    cmake -S . -B build/msvc -A x64 -DGGML_NATIVE=OFF
    make --build build/msvc --config Release

But when using clang, it's not working:

    cmake -S . -B build/winx64 -G "Ninja Multi-Config" -D CMAKE_TOOLCHAIN_FILE="../cmake/x64-windows-llvm.cmake" -DGGML_NATIVE=OFF
    cmake --build build/winx64 --config Release

where ../cmake/x64-windows-llvm.cmake is: https://github.com/ggerganov/llama.cpp/blob/master/cmake/x64-windows-llvm.cmake

Animation

Same setup is working with the MSVC build:

Animation

What's more interesting is that the deadlock from the clang process is blocking that entire CPU and causing all other processes of whisper to fail if using the same CPU (however, when the process for the clang build is killed with Ctrl + C, the other one resumes):

Animation

Not sure if the same behavior will be observed on Arm64 for Clang or just on x64, but Arm64 can be built now only with Clang, so it would be great if someone can test this.

Thank you!

@ggerganov
Copy link
Owner

Can you reproduce with any of the whisper.cpp examples, so that we can make sure the problem is in whisper.cpp and not in whisper.net.

@sandrohanea
Copy link
Contributor Author

Can you reproduce with any of the whisper.cpp examples, so that we can make sure the problem is in whisper.cpp and not in whisper.net.

The repro that I added here is just a simple C++ example (no whisper.net involved):

sandrohanea@07edc83#diff-8012be6a28736dd93582661245965820609c958d8d97ae03750257a2e11f35e1R34

It's basically just starting multiple inference calls in parallel:


#include "common.h"
#include "whisper.h"
#include "grammar-parser.h"
#include <string>
#include <thread>
#include <iostream>
using namespace std;
void RunInference(whisper_context* context, int index)
{
    auto input = new float[16000 * 2];
    for (int i = 0; i < 16000 * 2; i++) {
        input[i] = (float)rand() / RAND_MAX * 2 - 1;
    }
    // Generating random wave file with float (1, -1) as input (2 seconds at 16kHz)
    cout << "Staring inference number " << index << endl;
    whisper_state* state = whisper_init_state(context);
    auto params = whisper_full_default_params(WHISPER_SAMPLING_BEAM_SEARCH);
    whisper_full_with_state(context, state, params, input, 16000 * 2);
    cout << "Finished inference number " << index << endl;
    whisper_free_state(state);
}
int main(int argc, char** argv) {
    cout << "Waiting for setting the affinity context on the CPU to make sure only one CPU is used" << endl;
    cin.get();
    whisper_context* ctx = whisper_init_from_file("C:\\Projects\\sandrohanea\\whisper.net\\whisper.cpp\\models\\ggml-tiny.bin");
    // Optional Inference call to warm-up the runtime (but the issue reproduces with or without this)
    RunInference(ctx, 0);
    // Starting some inference calls in parallel
    vector<thread> threads;
    for (int i = 0; i < 4; i++) {
        threads.push_back(thread(RunInference, ctx, i));
    }
    for (auto& t : threads) {
        t.join();
    }
    whisper_free(ctx);
}

@ggerganov
Copy link
Owner

The thread sanitizer reports a couple of data races, though I'm not sure if these could cause the problem that you observe:

==================
WARNING: ThreadSanitizer: data race (pid=28458)
  Write of size 4 at 0x0001070f0118 by thread T528:
    #0 ggml_threadpool_new_impl ggml-cpu.c:13767 (libggml-cpu.dylib:arm64+0x1cf7c)
    #1 ggml_graph_compute ggml-cpu.c:13837 (libggml-cpu.dylib:arm64+0x1ec20)
    #2 ggml_backend_cpu_graph_compute(ggml_backend*, ggml_cgraph*) ggml-cpu.cpp:158 (libggml-cpu.dylib:arm64+0x193410)
    #3 ggml_backend_graph_compute_async ggml-backend.cpp:332 (libggml-base.dylib:arm64+0xc8430)
    #4 ggml_backend_sched_compute_splits(ggml_backend_sched*) ggml-backend.cpp:1397 (libggml-base.dylib:arm64+0xe94a8)
    #5 ggml_backend_sched_graph_compute_async ggml-backend.cpp:1588 (libggml-base.dylib:arm64+0xe6aa0)
    #6 ggml_backend_sched_graph_compute ggml-backend.cpp:1572 (libggml-base.dylib:arm64+0xe6748)
    #7 ggml_graph_compute_helper(ggml_backend_sched*, ggml_cgraph*, int) whisper.cpp:183 (libwhisper.1.7.4.dylib:arm64+0x38f60)

  Previous read of size 4 at 0x0001070f0118 by thread T527:
    #0 ggml_threadpool_free ggml-cpu.c:13319 (libggml-cpu.dylib:arm64+0x12874)
    #1 ggml_graph_compute ggml-cpu.c:13884 (libggml-cpu.dylib:arm64+0x1f3ec)
    #2 ggml_backend_cpu_graph_compute(ggml_backend*, ggml_cgraph*) ggml-cpu.cpp:158 (libggml-cpu.dylib:arm64+0x193410)
    #3 ggml_backend_graph_compute_async ggml-backend.cpp:332 (libggml-base.dylib:arm64+0xc8430)
    #4 ggml_backend_sched_compute_splits(ggml_backend_sched*) ggml-backend.cpp:1397 (libggml-base.dylib:arm64+0xe94a8)
    #5 ggml_backend_sched_graph_compute_async ggml-backend.cpp:1588 (libggml-base.dylib:arm64+0xe6aa0)
    #6 ggml_backend_sched_graph_compute ggml-backend.cpp:1572 (libggml-base.dylib:arm64+0xe6748)
    #7 ggml_graph_compute_helper(ggml_backend_sched*, ggml_cgraph*, int) whisper.cpp:183 (libwhisper.1.7.4.dylib:arm64+0x38f60)

  Thread T528 (tid=16157693, running) created by main thread at:
    #0 pthread_create <null>:78651984 (libclang_rt.tsan_osx_dynamic.dylib:arm64e+0x309d8)
    #1 std::__1::__libcpp_thread_create[abi:ne180100](_opaque_pthread_t**, void* (*)(void*), void*) __threading_support:317 (whisper-repro:arm64+0x1000090c4)

  Thread T527 (tid=16157692, running) created by main thread at:
    #0 pthread_create <null>:78651984 (libclang_rt.tsan_osx_dynamic.dylib:arm64e+0x309d8)
    #1 std::__1::__libcpp_thread_create[abi:ne180100](_opaque_pthread_t**, void* (*)(void*), void*) __threading_support:317 (whisper-repro:arm64+0x1000090c4)

SUMMARY: ThreadSanitizer: data race ggml-cpu.c:13767 in ggml_threadpool_new_impl
==================
==================
WARNING: ThreadSanitizer: data race (pid=28458)
  Write of size 4 at 0x0001070f0124 by thread T528:
    #0 ggml_threadpool_new_impl ggml-cpu.c:13769 (libggml-cpu.dylib:arm64+0x1d218)
    #1 ggml_graph_compute ggml-cpu.c:13837 (libggml-cpu.dylib:arm64+0x1ec20)
    #2 ggml_backend_cpu_graph_compute(ggml_backend*, ggml_cgraph*) ggml-cpu.cpp:158 (libggml-cpu.dylib:arm64+0x193410)
    #3 ggml_backend_graph_compute_async ggml-backend.cpp:332 (libggml-base.dylib:arm64+0xc8430)
    #4 ggml_backend_sched_compute_splits(ggml_backend_sched*) ggml-backend.cpp:1397 (libggml-base.dylib:arm64+0xe94a8)
    #5 ggml_backend_sched_graph_compute_async ggml-backend.cpp:1588 (libggml-base.dylib:arm64+0xe6aa0)
    #6 ggml_backend_sched_graph_compute ggml-backend.cpp:1572 (libggml-base.dylib:arm64+0xe6748)
    #7 ggml_graph_compute_helper(ggml_backend_sched*, ggml_cgraph*, int) whisper.cpp:183 (libwhisper.1.7.4.dylib:arm64+0x38f60)

  Previous read of size 4 at 0x0001070f0124 by thread T2171:
    #0 ggml_graph_compute_poll_for_work ggml-cpu.c:13648 (libggml-cpu.dylib:arm64+0x26b0c)
    #1 ggml_graph_compute_check_for_work ggml-cpu.c:13661 (libggml-cpu.dylib:arm64+0x26550)
    #2 ggml_graph_compute_secondary_thread ggml-cpu.c:13704 (libggml-cpu.dylib:arm64+0x25ec8)

  Thread T528 (tid=16157693, running) created by main thread at:
    #0 pthread_create <null>:78650976 (libclang_rt.tsan_osx_dynamic.dylib:arm64e+0x309d8)
    #1 std::__1::__libcpp_thread_create[abi:ne180100](_opaque_pthread_t**, void* (*)(void*), void*) __threading_support:317 (whisper-repro:arm64+0x1000090c4)

  Thread T2171 (tid=16159873, finished) created by thread T527 at:
    #0 pthread_create <null>:78650976 (libclang_rt.tsan_osx_dynamic.dylib:arm64e+0x309d8)
    #1 ggml_threadpool_new_impl ggml-cpu.c:13798 (libggml-cpu.dylib:arm64+0x1e034)
    #2 ggml_graph_compute ggml-cpu.c:13837 (libggml-cpu.dylib:arm64+0x1ec20)
    #3 ggml_backend_cpu_graph_compute(ggml_backend*, ggml_cgraph*) ggml-cpu.cpp:158 (libggml-cpu.dylib:arm64+0x193410)
    #4 ggml_backend_graph_compute_async ggml-backend.cpp:332 (libggml-base.dylib:arm64+0xc8430)
    #5 ggml_backend_sched_compute_splits(ggml_backend_sched*) ggml-backend.cpp:1397 (libggml-base.dylib:arm64+0xe94a8)
    #6 ggml_backend_sched_graph_compute_async ggml-backend.cpp:1588 (libggml-base.dylib:arm64+0xe6aa0)
    #7 ggml_backend_sched_graph_compute ggml-backend.cpp:1572 (libggml-base.dylib:arm64+0xe6748)
    #8 ggml_graph_compute_helper(ggml_backend_sched*, ggml_cgraph*, int) whisper.cpp:183 (libwhisper.1.7.4.dylib:arm64+0x38f60)

SUMMARY: ThreadSanitizer: data race ggml-cpu.c:13769 in ggml_threadpool_new_impl
==================
Finished inference number 3
Finished inference number 1
Finished inference number 0

Does it work with GGML_OPENMP=OFF?

@slaren
Copy link
Collaborator

slaren commented Jan 12, 2025

however, when the process for the clang build is killed with Ctrl + C, the other one resumes

That doesn't look like a deadlock, it seems more likely that the performance is just very bad when using more threads than available. This is not completely unexpected, many parts need cooperation of all the threads to proceed, and if there are not enough threads available, then what's going to happen is that it is going to spin for a while, lose its time slice, then the next thread is going to be scheduled and waste more time spinning and so on. I am not sure that this is something that needs to be fixed, just do not use more threads than available.

@sandrohanea
Copy link
Contributor Author

It seems that GGML_OPENMP=OFF makes the MSVC build to fail as well.

So, most likely the problem is with GGML_OPENMP=OFF and clang just couldn't find openmp while MSVC can.
I'm trying to confirm now if running clang build with GGML_OPENMP=ON will work

@sandrohanea
Copy link
Contributor Author

however, when the process for the clang build is killed with Ctrl + C, the other one resumes

That doesn't look like a deadlock, it seems more likely that the performance is just very bad when using more threads than available. This is not completely unexpected, many parts need cooperation of all the threads to proceed, and if there are not enough threads available, then what's going to happen is that it is going to spin for a while, lose its time slice, then the next thread is going to be scheduled and waste more time spinning and so on. I am not sure that this is something that needs to be fixed, just do not use more threads than available.

It stayed for more than 5 hours on github actions (here: https://github.com/sandrohanea/whisper.net/actions/runs/12726221171/job/35474406671) and not one tests succeeded (which are usually finishing in a couple of seconds).

It's not impossible to be a problem with the performance, but such a big difference seems too much to me.

@slaren
Copy link
Collaborator

slaren commented Jan 12, 2025

Much of the synchronization without OpenMP uses spin locks, which is great for performance when all threads are running at the same time, but it is susceptible to very bad results when there is thread contention. The thread is waiting for other threads to finish their job by wasting time spinning, but actually there are no other threads running, and the only way for the other threads to finish their jobs is for the current thread to give up its time slice. It may actually be a deadlock in this case, it's not completely out of the question, but it is something that should be easy to avoid by not using more threads than available.

@slaren
Copy link
Collaborator

slaren commented Jan 12, 2025

@max-krasnyansky tagging you in case this is something you are interested in improving. The problem seems to be that when using the ggml threadpool when there is thread contention, the threads fail to make any progress.

@sandrohanea sandrohanea changed the title Clang build for Windows x64 is causing deadlock / spinlock when affinity is set to a single CPU GGML threadpool blocked when thread contention Jan 12, 2025
@sandrohanea
Copy link
Contributor Author

Much of the synchronization without OpenMP uses spin locks, which is great for performance when all threads are running at the same time, but it is susceptible to very bad results when there is thread contention. The thread is waiting for other threads to finish their job by wasting time spinning, but actually there are no other threads running, and the only way for the other threads to finish their jobs is for the current thread to give up its time slice. It may actually be a deadlock in this case, it's not completely out of the question, but it is something that should be easy to avoid by not using more threads than available.

In general, the number of running threads is smaller than the number of physical threads. The default parameters are already set to limit the number of threads to at most the number of physical threads, as seen here:

/*.n_threads =*/ std::min(4, (int32_t) std::thread::hardware_concurrency()),

However, in some environments, it might not be possible to determine the number of available threads (e.g., within certain containers, such as in a GitHub Actions environment), where the thread affinity is managed by the host.

I agree it is not ideal, and we can mitigate this by hardcoding the max number of threads to 1 if we don't control the environment where these will run. However, it would be great if ggml could handle thread contention cases more gracefully, allowing the processing to resume even under such conditions.

(renamed the issue as it is not related to clang)

Thanks Georgi and Diego for the quick answers and investigation!

@max-krasnyansky
Copy link
Contributor

@max-krasnyansky tagging you in case this is something you are interested in improving. The problem seems to be that when using the ggml threadpool when there is thread contention, the threads fail to make any progress.

Yep. Sorry for the delayed reply.
When we control the affinity explicitly (--cpu-mask option) we have a warning when the user provides fewer cores (bits) than n_threads.
We could also add a warning for n_online_cpus < n_threads, but if the user/system externally forces affinity to some limited number cores it'll be expensive to detect that. I'll dig some more into what OpenMP does in that case. Perhaps, they have some simple trick we can use.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants