"Double" the performance #659

janekb04 · 2023-03-26T20:46:31Z

Well, it uses 50% less power - that's "double" performance. Basically, instead of using spinlocks, I made whisper.cpp use condition variables with mutices (mutex plural?).

Whisper is not really a low latency system. This means that busy locks aren't the best choice for synchronisation. The more so, that whisper.cpp is also supposed to run on the web and on mobile devices, where users usually care about power usage. In this PR, I made whisper.cpp use the classical conditional variable + mutex lock schema instead. On a 12900KS without overclocking, this reduces the CPU usage (and hence the power consumption) by half. On the other hand, if we go for full 100% utilization, the computation time is reduced by about 25%. Performance tables below.

This is a draft because I haven't implemented the lock using pthreads yet, and the current Windows implementation is rather naive and suboptimal. I am also yet to optimize the computations themselves.

…bench.exe and whisper.dll for convenience, similarly to linux artifacts

Ease windows development

… using a thread pool in the future.

janekb04 · 2023-03-26T20:50:43Z

Original version, 24 threads (95% utilization)

Running ggml_mul_mat benchmark with 24 threads

ggml_mul_mat:    64 x    64: F16      0.3 GFLOPS (128 runs) / F32      0.2 GFLOPS (128 runs)
ggml_mul_mat:   128 x   128: F16      2.7 GFLOPS (128 runs) / F32      1.5 GFLOPS (128 runs)
ggml_mul_mat:   256 x   256: F16     10.1 GFLOPS (128 runs) / F32     19.1 GFLOPS (128 runs)
ggml_mul_mat:   512 x   512: F16     61.7 GFLOPS (128 runs) / F32     74.5 GFLOPS (128 runs)
ggml_mul_mat:  1024 x  1024: F16    138.7 GFLOPS ( 65 runs) / F32    162.6 GFLOPS ( 76 runs)
ggml_mul_mat:  2048 x  2048: F16    184.7 GFLOPS ( 11 runs) / F32    192.5 GFLOPS ( 12 runs)
ggml_mul_mat:  4096 x  4096: F16    174.4 GFLOPS (  3 runs) / F32     94.8 GFLOPS (  3 runs)

CPU	OS	Config	Model	Th	Load	Enc.	Commit
12th Gen Intel(R) Core(TM) i9-12900KS	Microsoft Windows 11 Pro	AVX2	tiny	24	98	374	`fd83fb2`
12th Gen Intel(R) Core(TM) i9-12900KS	Microsoft Windows 11 Pro	AVX2	base	24	153	1023	`fd83fb2`
12th Gen Intel(R) Core(TM) i9-12900KS	Microsoft Windows 11 Pro	AVX2	small	24	437	2896	`fd83fb2`
12th Gen Intel(R) Core(TM) i9-12900KS	Microsoft Windows 11 Pro	AVX2	medium	24	1301	8510	`fd83fb2`
12th Gen Intel(R) Core(TM) i9-12900KS	Microsoft Windows 11 Pro	AVX2	large	24	2563	16643	`fd83fb2`

New version, 24 threads (50% utilization)

Running ggml_mul_mat benchmark with 24 threads

ggml_mul_mat:    64 x    64: F16      0.4 GFLOPS (128 runs) / F32      0.4 GFLOPS (128 runs)
ggml_mul_mat:   128 x   128: F16      3.0 GFLOPS (128 runs) / F32      2.9 GFLOPS (128 runs)
ggml_mul_mat:   256 x   256: F16     21.4 GFLOPS (128 runs) / F32     20.3 GFLOPS (128 runs)
ggml_mul_mat:   512 x   512: F16     95.0 GFLOPS (128 runs) / F32    100.3 GFLOPS (128 runs)
ggml_mul_mat:  1024 x  1024: F16    180.4 GFLOPS ( 84 runs) / F32    203.9 GFLOPS ( 95 runs)
ggml_mul_mat:  2048 x  2048: F16    207.3 GFLOPS ( 13 runs) / F32    179.8 GFLOPS ( 11 runs)
ggml_mul_mat:  4096 x  4096: F16    182.9 GFLOPS (  3 runs) / F32    107.1 GFLOPS (  3 runs)

CPU	OS	Config	Model	Th	Load	Enc.	Commit
12th Gen Intel(R) Core(TM) i9-12900KS	Microsoft Windows 11 Pro	AVX2	tiny	24	97	324	`de49899`
12th Gen Intel(R) Core(TM) i9-12900KS	Microsoft Windows 11 Pro	AVX2	base	24	158	689	`de49899`
12th Gen Intel(R) Core(TM) i9-12900KS	Microsoft Windows 11 Pro	AVX2	small	24	437	2384	`de49899`
12th Gen Intel(R) Core(TM) i9-12900KS	Microsoft Windows 11 Pro	AVX2	medium	24	1301	8923	`de49899`
12th Gen Intel(R) Core(TM) i9-12900KS	Microsoft Windows 11 Pro	AVX2	large	24	2540	16748	`de49899`

Old version, 120 threads (99% utilization)

Running ggml_mul_mat benchmark with 120 threads

ggml_mul_mat:    64 x    64: F16      0.0 GFLOPS (  3 runs) / F32      0.0 GFLOPS (  3 runs)
ggml_mul_mat:   128 x   128: F16      0.0 GFLOPS (  3 runs) / F32      0.0 GFLOPS (  3 runs)
ggml_mul_mat:   256 x   256: F16      0.0 GFLOPS (  3 runs) / F32      0.0 GFLOPS (  3 runs)
ggml_mul_mat:   512 x   512: F16      0.1 GFLOPS (  3 runs) / F32      0.2 GFLOPS (  3 runs)
ggml_mul_mat:  1024 x  1024: F16      1.6 GFLOPS (  3 runs) / F32      1.2 GFLOPS (  3 runs)
ggml_mul_mat:  2048 x  2048: F16     12.3 GFLOPS (  3 runs) / F32     12.6 GFLOPS (  3 runs)
ggml_mul_mat:  4096 x  4096: F16     68.5 GFLOPS (  3 runs) / F32     50.5 GFLOPS (  3 runs)

CPU	OS	Config	Model	Th	Load	Enc.	Commit
12th Gen Intel(R) Core(TM) i9-12900KS	Microsoft Windows 11 Pro	AVX2	tiny	120	96	78836	`fd83fb2`
12th Gen Intel(R) Core(TM) i9-12900KS	Microsoft Windows 11 Pro	AVX2	base	120	157	113952	`fd83fb2`
A while it took, indeed.

New version, 120 threads (90% utilization)

Running ggml_mul_mat benchmark with 120 threads

ggml_mul_mat:    64 x    64: F16      0.1 GFLOPS (106 runs) / F32      0.1 GFLOPS (106 runs)
ggml_mul_mat:   128 x   128: F16      0.4 GFLOPS (106 runs) / F32      0.4 GFLOPS (104 runs)
ggml_mul_mat:   256 x   256: F16      3.5 GFLOPS (106 runs) / F32      3.5 GFLOPS (104 runs)
ggml_mul_mat:   512 x   512: F16     25.1 GFLOPS ( 94 runs) / F32     25.7 GFLOPS ( 96 runs)
ggml_mul_mat:  1024 x  1024: F16    129.0 GFLOPS ( 61 runs) / F32    127.7 GFLOPS ( 60 runs)
ggml_mul_mat:  2048 x  2048: F16    248.5 GFLOPS ( 15 runs) / F32    179.5 GFLOPS ( 11 runs)
ggml_mul_mat:  4096 x  4096: F16    191.5 GFLOPS (  3 runs) / F32    121.3 GFLOPS (  3 runs)

CPU	OS	Config	Model	Th	Load	Enc.	Commit
12th Gen Intel(R) Core(TM) i9-12900KS	Microsoft Windows 11 Pro	AVX2	tiny	120	98	583	`de49899`
12th Gen Intel(R) Core(TM) i9-12900KS	Microsoft Windows 11 Pro	AVX2	base	120	158	972	`de49899`
12th Gen Intel(R) Core(TM) i9-12900KS	Microsoft Windows 11 Pro	AVX2	small	120	435	2588	`de49899`
12th Gen Intel(R) Core(TM) i9-12900KS	Microsoft Windows 11 Pro	AVX2	medium	120	1296	7457	`de49899`
12th Gen Intel(R) Core(TM) i9-12900KS	Microsoft Windows 11 Pro	AVX2	large	120	2540	12715	`de49899`

janekb04 · 2023-03-26T20:53:41Z

I also added a script that automatically runs all benchmarks on Windows. It is simply the shell script version converted into Powershell.

janekb04 · 2023-03-26T21:23:53Z

And here's the pthread version. Now this should be mergeable, though as I wrote, I am planning further optimizations. macOS tables below.

janekb04 · 2023-03-26T21:49:03Z

Original version, 6 threads (75% utilization)

Running ggml_mul_mat benchmark with 6 threads

ggml_mul_mat:    64 x    64: F16      6.0 GFLOPS (128 runs) / F32      5.2 GFLOPS (128 runs)
ggml_mul_mat:   128 x   128: F16     66.8 GFLOPS (128 runs) / F32     42.1 GFLOPS (128 runs)
ggml_mul_mat:   256 x   256: F16    356.6 GFLOPS (128 runs) / F32    283.2 GFLOPS (128 runs)
ggml_mul_mat:   512 x   512: F16    459.2 GFLOPS (128 runs) / F32    530.5 GFLOPS (128 runs)
ggml_mul_mat:  1024 x  1024: F16    937.4 GFLOPS (128 runs) / F32   1379.4 GFLOPS (128 runs)
ggml_mul_mat:  2048 x  2048: F16   1217.2 GFLOPS ( 71 runs) / F32   1557.5 GFLOPS ( 91 runs)
ggml_mul_mat:  4096 x  4096: F16   1695.6 GFLOPS ( 13 runs) / F32   1431.4 GFLOPS ( 11 runs)

CPU	OS	Config	Model	Th	Load	Enc.	Commit
Apple M1 Pro	macOS 13.2.1 (22D68)	NEON BLAS	tiny	6	49	106	`fd83fb2`
Apple M1 Pro	macOS 13.2.1 (22D68)	NEON BLAS	base	6	64	196	`fd83fb2`
Apple M1 Pro	macOS 13.2.1 (22D68)	NEON BLAS	small	6	178	674	`fd83fb2`
Apple M1 Pro	macOS 13.2.1 (22D68)	NEON BLAS	medium	6	558	1940	`fd83fb2`
Apple M1 Pro	macOS 13.2.1 (22D68)	NEON BLAS	large	6	1246	3547	`fd83fb2`

Original version, 10 threads (90% utilization)

Running ggml_mul_mat benchmark with 10 threads

ggml_mul_mat:    64 x    64: F16      3.7 GFLOPS (128 runs) / F32      3.0 GFLOPS (128 runs)
ggml_mul_mat:   128 x   128: F16     35.1 GFLOPS (128 runs) / F32     19.5 GFLOPS (128 runs)
ggml_mul_mat:   256 x   256: F16    223.6 GFLOPS (128 runs) / F32    130.8 GFLOPS (128 runs)
ggml_mul_mat:   512 x   512: F16    597.1 GFLOPS (128 runs) / F32    574.2 GFLOPS (128 runs)
ggml_mul_mat:  1024 x  1024: F16    773.1 GFLOPS (128 runs) / F32    528.9 GFLOPS (128 runs)
ggml_mul_mat:  2048 x  2048: F16    599.6 GFLOPS ( 35 runs) / F32    485.6 GFLOPS ( 29 runs)
ggml_mul_mat:  4096 x  4096: F16   1005.8 GFLOPS (  8 runs) / F32    722.6 GFLOPS (  6 runs)

CPU	OS	Config	Model	Th	Load	Enc.	Commit
Apple M1 Pro	macOS 13.2.1 (22D68)	NEON BLAS	tiny	10	46	131	`fd83fb2`
Apple M1 Pro	macOS 13.2.1 (22D68)	NEON BLAS	base	10	65	286	`fd83fb2`
Apple M1 Pro	macOS 13.2.1 (22D68)	NEON BLAS	small	10	180	1105	`fd83fb2`
Apple M1 Pro	macOS 13.2.1 (22D68)	NEON BLAS	medium	10	526	3225	`fd83fb2`
Apple M1 Pro	macOS 13.2.1 (22D68)	NEON BLAS	large	10	1237	5546	`fd83fb2`

New version, 6 threads (45% utilization)

Running ggml_mul_mat benchmark with 10 threads

[deadlock of some kind, I'll have to look into this]

CPU	OS	Config	Model	Th	Load	Enc.	Commit
Apple M1 Pro	macOS 13.2.1 (22D68)	NEON BLAS	tiny	6	49	121	`a6ee46f`
Apple M1 Pro	macOS 13.2.1 (22D68)	NEON BLAS	base	6	66	203	`a6ee46f`
Apple M1 Pro	macOS 13.2.1 (22D68)	NEON BLAS	small	6	175	667	`a6ee46f`
Apple M1 Pro	macOS 13.2.1 (22D68)	NEON BLAS	medium	6	507	1818	`a6ee46f`
Apple M1 Pro	macOS 13.2.1 (22D68)	NEON BLAS	large	6	1315	3249	`a6ee46f`

New version, 10 threads (50% utilization)

Running ggml_mul_mat benchmark with 10 threads

[deadlock of some kind, I'll have to look into this]

CPU	OS	Config	Model	Th	Load	Enc.	Commit
Apple M1 Pro	macOS 13.2.1 (22D68)	NEON BLAS	tiny	10	42	129	`a6ee46f`
Apple M1 Pro	macOS 13.2.1 (22D68)	NEON BLAS	base	10	67	240	`a6ee46f`
Apple M1 Pro	macOS 13.2.1 (22D68)	NEON BLAS	small	10	176	744	`a6ee46f`
Apple M1 Pro	macOS 13.2.1 (22D68)	NEON BLAS	medium	10	545	1933	`a6ee46f`
Apple M1 Pro	macOS 13.2.1 (22D68)	NEON BLAS	large	10	1419	3347	`a6ee46f`

janekb04 · 2023-03-26T21:51:25Z

I didn't expect performance on macOS to be so good. Anyway, it appears that for both versions, running with 6 threads is a performance sweet spot. The cool thing is that the new version uses about 40% less CPU while being about 9% faster.

janekb04 · 2023-03-26T21:52:47Z

This isn't ready yet, as, for some reason, the ggml_mul_mat benchmark deadlocks now. I'll look into this.

janekb04 · 2023-03-26T22:06:33Z

I wasn't able to measure the energy impact because the Activity Monitor is useless in that regard.

janekb04 · 2023-03-26T22:12:01Z

I just saw that ggml.c is copy-pasted to llama.cpp. I'll see if it improves performance there.

ggerganov · 2023-03-27T05:37:43Z

@janekb04
This is very nice work!
I've always had doubts that the existing spin-lock approach is not optimal, but my attempts of adding mutex and condition variables were giving worse performance overall.

I haven't tested and looked at the proposed changes, but the reported results look promising.
However, it is also important to measure the performance in the Decoder. It's different from the Encoder since there we don't rely on Accelerate's sgemm and there are high-frequency ggml_mul_mat calls for smaller matrices.

There is no existing benchmark for the Decoder, but you can simply run the transcription for some of the sample audio files and look at the reported time/per at the end.

I will take a more detailed look in the following days.

ggerganov · 2023-03-27T05:55:34Z

I just saw that ggml.c is copy-pasted to llama.cpp. I'll see if it improves performance there.

The ggml.c in llama.cpp has some new extra stuff added and I haven't yet synchronized it with whisper.cpp.
You won't be able to copy paste the ggml.c from here into llama.cpp - they are incompatible atm.

I will fix this soon.
For now, you can just re-apply your changes to the ggml.c in llama.cpp to see how is the performance there

Update from upstream

anzz1 · 2023-03-30T13:26:42Z

Mutexes and events that signal them are best for the most cases, but they usually aren't very fast nor precise, but if the when isn't important they would offer best performance. However, mutex/event and spinlock aren't the only options, there is also sleep/yield.

While not directly related, here's some research I did and an example implementation on using sleep-yield in place of spinlocking where latency and accuracy was utmost priority (need stable frametimes) and at least on Windows it was still accurate to the point where I couldn't even measure below that (~10μs / ~0.01ms , even the call to QueryPerformanceCounter takes ~1μs so ...) , being way faster and more accurate than actually needed for that use case while using basically zero power being an order of unknown amount of magnitudes more efficient (so less power hungry) than the spinlock alternative.

Didn't really look what's the problem or if this is applicable, but just dropped this info if someone happens to need it as the "third option" isn't as readily found by a simple google search.

janekb04 · 2023-04-02T13:38:36Z

Mutexes and events that signal them are best for the most cases, but they usually aren't very fast nor precise, but if the when isn't important they would offer best performance. However, mutex/event and spinlock aren't the only options, there is also sleep/yield.

While not directly related, here's some research I did and an example implementation on using sleep-yield in place of spinlocking where latency and accuracy was utmost priority (need stable frametimes) and at least on Windows it was still accurate to the point where I couldn't even measure below that (~10μs / ~0.01ms , even the call to QueryPerformanceCounter takes ~1μs so ...) , being way faster and more accurate than actually needed for that use case while using basically zero power being an order of unknown amount of magnitudes more efficient (so less power hungry) than the spinlock alternative.

Didn't really look what's the problem or if this is applicable, but just dropped this info if someone happens to need it as the "third option" isn't as readily found by a simple google search.

I am currently developing a realtime await-async system for C++ that works just like described here. It has even better latency because a job-switch there is on the order of tens of nanoseconds. However it is rather early stage and very unstable. There are also existing systems that work like this (a few are in boost - asio, coroutines, fibers). Unfortunately, they are for C++, which allows for some nice syntactic sugar that wouldn't be possible in C (especially mine, as I overload the co_await operators. It has a bloated meta programming implementation but the user code looks very similar to Python or JavaScript). I don't know if it would be feasible to introduce that here.

As far as I understand the code, the current work scheduling is less than ideal. The main thread launches some N threads. Then, it creates the "compute graph". I assume that it is a DAG with each node representing some computation. I assume that it is topologically sorted before the main for (int i = 0; i < cgraph->n_nodes; i++) loop. The loop sequentially goes through all the nodes. If the "compute graph" is indeed a sorted DAG, then, here comes the optimization: instead of going "for node in graph: for task in node:", the tasks from nondependent nodes could be run independently and each node. This would mean that fundamentally, the code would work like:

Main thread:

compute_graph G; // topologically-sorted
multithreaded_queue<task> Q;
for (node& n : G) {
    // The number of incoming edges
    // ie. the number of dependencies
    if (n.dependency_count.nonatomic_load() > 0)
        break;
    Q.batch_enqueue(n.tasks);
}
Q.start_working();
execute_work()
// cleanup
return [the result]

Worker threads execute execute_work function:

Q.wait_for_start_working_blocking();
while (!Q.done()) {
    task to_do = Q.pop_blocking();
    execute(to_do);
    
    // if this was the last task for this node, the node has completed
    if(to_do.node.task_count.atomic_fetch_sub(1) == 1) {
        // so, all the node's dependents have one dependency less
        for (node& n : to_do.node.dependents) {
             // if the current node was the last dependency of this node
             // we can enqueue this node's tasks for execution
             if (n.dependency_count.atomic_fetch_sub(1) == 1) {
                 Q.batch_enqueue(n.tasks);
             }
        }
    }
}

This design should eliminate all the blocking and waiting and maximize the amount of time spent by the threads on executing useful work.

janekb04 · 2023-04-02T13:44:45Z

There are also a few minor things here and there. One I found is alloca, here:

struct ggml_compute_state * workers = n_threads > 1 ? alloca(sizeof(struct ggml_compute_state)*(n_threads - 1)) : NULL;

using alloca is confusing for the compiler. It no longer has a function frame with locals positioned at deterministic addresses. Instead, it has to do more address computations that depend on the allocated memory block size.

JKeddo95 · 2023-04-13T15:20:26Z

Hi there -- Do you think your original changes will still work with llama.cpp backported updates? It would be pretty cool to have two strong performance improvements in a row!

janekb04 · 2023-04-17T08:32:30Z

@JKeddo95 I took my time to read through the changes and pulled them in and as far as I can tell, this PR is still valid.

anzz1 · 2023-04-18T20:51:52Z

To me this looks to be a very clean and well thought-out PR. I fully agree with the implementation provided and think this is absolutely the proper way to go.

There are multiple ways to implement locking, but the lightweight mutexes used here are the best option in most cases.

Spinlocks are rarely the right option, namely only when at least one of these conditions apply:

It's mission critical to release a lock as fast as possible, where the when needs to have nanosecond precision
If thread context switching needs to be unequivocally disallowed
When locking/sleeping/spinning very often and for very short (< ~1us) periods of time, where the cost (execution time) of context switch is larger than the time spent spinning

When even the CPU manufacturers themselves since at least 2011 advise against using spinlocks unless necessary, I'd take their word for it. But the beauty here is that we don't even have to take their word, when your performance tests confirm this to be true.
Maybe 15 years ago there was a point in time where not letting the CPU sleep and doing stuff like disabling C-states in BIOS could improve performance, this hasn't been the case for a long time now. Processor tech has developed a lot since then and nowadays they perform better by letting them sleep and giving them headroom to manage the work.
Especially now when we're reaching the tail-end of Moore's law, the thermal and power limits are more of an issue than ever before and processors are pushing what's possible to the limits and absolutely do gain performance from the ability to 'breathe' by decreasing the power use==thermal output of program code with methods like this (not wasting cycles by unnecessarily spinning and keeping the thread locked with usage at 100%).

Pros:

Uses lightweight mutexes which is the best locking option for this use-case (and most use-cases for that matter)
Uses the low-level locking mechanisms of POSIX threads and Windows critical sections
Doesn't use high-level abstractions like STL std::lock , which decrease performance by adding unnecessary code only to end up calling the same low-level functions at the end anyway. Having less abstraction also means easier low-level debugging,
Does all this in a minimal amount of code
For this use case I don't see how this implementation could have a performance loss in any configuration, there should be only gains. The gains should scale up when having less thermal (= power) headroom, with inadequately cooled and power-hungry configurations gaining the most. On a x86 laptop you'd have both situations, so those stand to gain the most.

Cons:

None that I can think of

Theres some more in-depth discussion about thread locking over at the llama.cpp on this (now abandoned) PR: [llama.cpp] ggml: refactor compute thread: merge three spin variables into one #816

It's a long thread and not necessarily everything applies, for example the proposition I made there about adding an #ifdef option for the different lock/sleep conditions I no longer agree with after having a second thought, as it would add unnecessary code complexity with no real advantages. I also made the point of being mindful about the cost of context switching, but this is clearly a non-issue here. To be clear there's nothing to be added from that discussion to this PR, but there is some good context and information for those interested to dig in deeper.

All in all, presentation-wise this is one of the best PR's I've ever seen anywhere in terms of reasoning and testing, especially the well laid-out performance tests along multiple architectures and operating systems is just perfect, more than can be reasonably expected from anyone. In fact, I am bookmarking this PR as an example on the subject of "How to make a perfect PR".

Few clarifications I would like to ask about though:

This is a draft because I haven't implemented the lock using pthreads yet, ...

You have now, so the PR description could be edited to reflect this?

... and the current Windows implementation is rather naive and suboptimal.

Can you elaborate on this? Looking through the code it looks like you have the most optimalest solution. Regular mutexes are kernel objects which require the thread to switch usermode->kernelmode->usermode whenever they are used unlike critical sections which can stay in usermode and afaik this is the reason they are so much faster. The CriticalSection/Condition paradigm is the fastest way to implement locking on Windows outside of spinlocks (which I wouldn't really call a mechanism anyway). The paradigm is widely used in the low-level OS and kernel, and my line of thinking usually goes that if something is good enough for low-level OS/kernel code, it's probably good full-stop.

I am also yet to optimize the computations themselves.

As that is a different beast altogether, I'd say better go ahead and merge this and not keep it unnecessarily blocked waiting further developments as on its' own this PR looks ready to go? In my opinion more & smaller PRs is a better option over less & larger ones anyway, as it makes maintainability easier and it's easier to revert small pieces if necessary when something goes wrong.

janekb04 · 2023-04-19T07:53:29Z

@anzz1 Thanks for the thoughtful evaluation. I updated the PR description.

Regarding the naïvety, I wrote that because I literally coded this off the top of my head, in one sitting. I just opened ggml.c, searched for "lock" and 2-3 hours later I was done. So I didn't really want to call this "optimal" or "complete" as it was something I quickly hacked. Also, I thought that it could be better to use a completely different scheduling approach, but that would be indeed a "beast" of a rework. So, yes, it looks to be mergable.

nchudleigh · 2023-08-29T02:59:14Z

@ggerganov @janekb04 what would this require to get pulled into master? happy to take @janekb04 's work and clean it up / test it if that is all that is required.

lilezek · 2023-10-12T14:28:17Z

I tried this and it happens to be way slower in my setup:

CPU: 6-core AMD Ryzen 5 5560U with Radeon Graphics (-MT MCP-) speed/min/max: 2459/1600/4061 MHz
Kernel: 6.2.0-34-generic x86_64 Up: 47m Mem: 5465.5/12879.6 MiB (42.4%)
Storage: 465.76 GiB (5.9% used) Procs: 304 Shell: Bash inxi: 3.3.25

In my tests, it goes from around 75s to process a 60 second audio to 103/110s (in 4 threads).
Running top shows that the usage moves between 250% to 380%.

When running 6 threads, it goes between 250% and 560%, and it takes 114s.

The commit id in my logs shows 7a5a5fe86dfd9c3566b2c584a7553596bdae68ac.

Am I doing something wrong or dit this branch get outdated?

janekb04 added 9 commits March 26, 2023 14:41

Add to .gitignore: CMakeSettings.json for those using Visual Studio. …

a94d015

…bench.exe and whisper.dll for convenience, similarly to linux artifacts

Add powershell version of bench-all for running on Windows.

bbb90f9

Fix script name in usage string

0af3c99

Actually run the memcpy and ggml_mat_mul benchmarks.

2dbba27

Remove unnecessary redirection.

4f8e788

Merge pull request #3 from janekb04/ease-windows-development

fd83fb2

Ease windows development

Use stdatomic.h when available on Windows. Add ground for potentially…

fa6d91c

… using a thread pool in the future.

Fix return type

071f410

Do not busy wait. Windows implementation.

de49899

janekb04 added 2 commits March 26, 2023 22:58

Remove legacy comments

9b51c47

Blind pthread implementation.

a6ee46f

Merge pull request #4 from ggerganov/master

469cb08

Update from upstream

Merge branch 'ggerganov:master' into improve-performance

24705df

Merge branch 'ggerganov:master' into improve-performance

7a5a5fe

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

"Double" the performance #659

"Double" the performance #659

janekb04 commented Mar 26, 2023 •

edited

Loading

janekb04 commented Mar 26, 2023 •

edited

Loading

janekb04 commented Mar 26, 2023

janekb04 commented Mar 26, 2023

janekb04 commented Mar 26, 2023

janekb04 commented Mar 26, 2023

janekb04 commented Mar 26, 2023 •

edited

Loading

janekb04 commented Mar 26, 2023

janekb04 commented Mar 26, 2023

ggerganov commented Mar 27, 2023 •

edited

Loading

ggerganov commented Mar 27, 2023 •

edited

Loading

anzz1 commented Mar 30, 2023 •

edited

Loading

janekb04 commented Apr 2, 2023

janekb04 commented Apr 2, 2023

JKeddo95 commented Apr 13, 2023

janekb04 commented Apr 17, 2023

anzz1 commented Apr 18, 2023 •

edited

Loading

janekb04 commented Apr 19, 2023

nchudleigh commented Aug 29, 2023

lilezek commented Oct 12, 2023

"Double" the performance #659

Are you sure you want to change the base?

"Double" the performance #659

Conversation

janekb04 commented Mar 26, 2023 • edited Loading

janekb04 commented Mar 26, 2023 • edited Loading

Original version, 24 threads (95% utilization)

New version, 24 threads (50% utilization)

Old version, 120 threads (99% utilization)

New version, 120 threads (90% utilization)

janekb04 commented Mar 26, 2023

janekb04 commented Mar 26, 2023

janekb04 commented Mar 26, 2023

Original version, 6 threads (75% utilization)

Original version, 10 threads (90% utilization)

New version, 6 threads (45% utilization)

New version, 10 threads (50% utilization)

janekb04 commented Mar 26, 2023

janekb04 commented Mar 26, 2023 • edited Loading

janekb04 commented Mar 26, 2023

janekb04 commented Mar 26, 2023

ggerganov commented Mar 27, 2023 • edited Loading

ggerganov commented Mar 27, 2023 • edited Loading

anzz1 commented Mar 30, 2023 • edited Loading

janekb04 commented Apr 2, 2023

janekb04 commented Apr 2, 2023

JKeddo95 commented Apr 13, 2023

janekb04 commented Apr 17, 2023

anzz1 commented Apr 18, 2023 • edited Loading

janekb04 commented Apr 19, 2023

nchudleigh commented Aug 29, 2023

lilezek commented Oct 12, 2023

janekb04 commented Mar 26, 2023 •

edited

Loading

janekb04 commented Mar 26, 2023 •

edited

Loading

janekb04 commented Mar 26, 2023 •

edited

Loading

ggerganov commented Mar 27, 2023 •

edited

Loading

ggerganov commented Mar 27, 2023 •

edited

Loading

anzz1 commented Mar 30, 2023 •

edited

Loading

anzz1 commented Apr 18, 2023 •

edited

Loading