-
Notifications
You must be signed in to change notification settings - Fork 4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
"Double" the performance #659
base: master
Are you sure you want to change the base?
Conversation
…bench.exe and whisper.dll for convenience, similarly to linux artifacts
Ease windows development
… using a thread pool in the future.
Original version, 24 threads (95% utilization)
New version, 24 threads (50% utilization)
Old version, 120 threads (99% utilization)
New version, 120 threads (90% utilization)
|
I also added a script that automatically runs all benchmarks on Windows. It is simply the shell script version converted into Powershell. |
And here's the |
Original version, 6 threads (75% utilization)
Original version, 10 threads (90% utilization)
New version, 6 threads (45% utilization)
New version, 10 threads (50% utilization)
|
I didn't expect performance on macOS to be so good. Anyway, it appears that for both versions, running with 6 threads is a performance sweet spot. The cool thing is that the new version uses about 40% less CPU while being about 9% faster. |
This isn't ready yet, as, for some reason, the |
I wasn't able to measure the energy impact because the Activity Monitor is useless in that regard. |
I just saw that |
@janekb04 I haven't tested and looked at the proposed changes, but the reported results look promising. There is no existing benchmark for the Decoder, but you can simply run the transcription for some of the sample audio files and look at the reported time/per at the end. I will take a more detailed look in the following days. |
The I will fix this soon. |
Update from upstream
Mutexes and events that signal them are best for the most cases, but they usually aren't very fast nor precise, but if the when isn't important they would offer best performance. However, mutex/event and spinlock aren't the only options, there is also sleep/yield. While not directly related, here's some research I did and an example implementation on using sleep-yield in place of spinlocking where latency and accuracy was utmost priority (need stable frametimes) and at least on Windows it was still accurate to the point where I couldn't even measure below that (~10μs / ~0.01ms , even the call to QueryPerformanceCounter takes ~1μs so ...) , being way faster and more accurate than actually needed for that use case while using basically zero power being an order of unknown amount of magnitudes more efficient (so less power hungry) than the spinlock alternative. Didn't really look what's the problem or if this is applicable, but just dropped this info if someone happens to need it as the "third option" isn't as readily found by a simple google search. |
I am currently developing a realtime As far as I understand the code, the current work scheduling is less than ideal. The main thread launches some Main thread: compute_graph G; // topologically-sorted
multithreaded_queue<task> Q;
for (node& n : G) {
// The number of incoming edges
// ie. the number of dependencies
if (n.dependency_count.nonatomic_load() > 0)
break;
Q.batch_enqueue(n.tasks);
}
Q.start_working();
execute_work()
// cleanup
return [the result] Worker threads execute Q.wait_for_start_working_blocking();
while (!Q.done()) {
task to_do = Q.pop_blocking();
execute(to_do);
// if this was the last task for this node, the node has completed
if(to_do.node.task_count.atomic_fetch_sub(1) == 1) {
// so, all the node's dependents have one dependency less
for (node& n : to_do.node.dependents) {
// if the current node was the last dependency of this node
// we can enqueue this node's tasks for execution
if (n.dependency_count.atomic_fetch_sub(1) == 1) {
Q.batch_enqueue(n.tasks);
}
}
}
} This design should eliminate all the blocking and waiting and maximize the amount of time spent by the threads on executing useful work. |
There are also a few minor things here and there. One I found is struct ggml_compute_state * workers = n_threads > 1 ? alloca(sizeof(struct ggml_compute_state)*(n_threads - 1)) : NULL; using |
Hi there -- Do you think your original changes will still work with llama.cpp backported updates? It would be pretty cool to have two strong performance improvements in a row! |
@JKeddo95 I took my time to read through the changes and pulled them in and as far as I can tell, this PR is still valid. |
To me this looks to be a very clean and well thought-out PR. I fully agree with the implementation provided and think this is absolutely the proper way to go. There are multiple ways to implement locking, but the lightweight mutexes used here are the best option in most cases. Spinlocks are rarely the right option, namely only when at least one of these conditions apply:
When even the CPU manufacturers themselves since at least 2011 advise against using spinlocks unless necessary, I'd take their word for it. But the beauty here is that we don't even have to take their word, when your performance tests confirm this to be true. Pros:
Cons:
Theres some more in-depth discussion about thread locking over at the llama.cpp on this (now abandoned) PR: [llama.cpp] ggml: refactor compute thread: merge three spin variables into one #816 It's a long thread and not necessarily everything applies, for example the proposition I made there about adding an #ifdef option for the different lock/sleep conditions I no longer agree with after having a second thought, as it would add unnecessary code complexity with no real advantages. I also made the point of being mindful about the cost of context switching, but this is clearly a non-issue here. To be clear there's nothing to be added from that discussion to this PR, but there is some good context and information for those interested to dig in deeper. All in all, presentation-wise this is one of the best PR's I've ever seen anywhere in terms of reasoning and testing, especially the well laid-out performance tests along multiple architectures and operating systems is just perfect, more than can be reasonably expected from anyone. In fact, I am bookmarking this PR as an example on the subject of "How to make a perfect PR".
Few clarifications I would like to ask about though:
You have now, so the PR description could be edited to reflect this?
Can you elaborate on this? Looking through the code it looks like you have the most optimalest solution. Regular mutexes are kernel objects which require the thread to switch usermode->kernelmode->usermode whenever they are used unlike critical sections which can stay in usermode and afaik this is the reason they are so much faster. The CriticalSection/Condition paradigm is the fastest way to implement locking on Windows outside of spinlocks (which I wouldn't really call a mechanism anyway). The paradigm is widely used in the low-level OS and kernel, and my line of thinking usually goes that if something is good enough for low-level OS/kernel code, it's probably good full-stop.
As that is a different beast altogether, I'd say better go ahead and merge this and not keep it unnecessarily blocked waiting further developments as on its' own this PR looks ready to go? In my opinion more & smaller PRs is a better option over less & larger ones anyway, as it makes maintainability easier and it's easier to revert small pieces if necessary when something goes wrong. |
@anzz1 Thanks for the thoughtful evaluation. I updated the PR description. Regarding the naïvety, I wrote that because I literally coded this off the top of my head, in one sitting. I just opened |
@ggerganov @janekb04 what would this require to get pulled into master? happy to take @janekb04 's work and clean it up / test it if that is all that is required. |
I tried this and it happens to be way slower in my setup:
In my tests, it goes from around 75s to process a 60 second audio to 103/110s (in 4 threads). When running 6 threads, it goes between 250% and 560%, and it takes 114s. The commit id in my logs shows Am I doing something wrong or dit this branch get outdated? |
Well, it uses 50% less power - that's "double" performance. Basically, instead of using spinlocks, I made
whisper.cpp
use condition variables with mutices (mutex plural?).Whisper is not really a low latency system. This means that busy locks aren't the best choice for synchronisation. The more so, that
whisper.cpp
is also supposed to run on the web and on mobile devices, where users usually care about power usage. In this PR, I madewhisper.cpp
use the classical conditional variable + mutex lock schema instead. On a 12900KS without overclocking, this reduces the CPU usage (and hence the power consumption) by half. On the other hand, if we go for full 100% utilization, the computation time is reduced by about 25%. Performance tables below.This is a draft because I haven't implemented the lock usingpthreads
yet, and the current Windows implementation is rather naive and suboptimal. I am also yet to optimize the computations themselves.