ONNX improvements (-62% in full-precision model size, 2.7x faster load and execution, quantizations) #73

xenova · 2024-12-14T21:52:50Z

This PR improves the moonshine-onnx package in the following ways:

Significant reductions in model size / downloads for both tiny and base models by ensuring tied weights are not duplicated, and that the decoders (w/ and w/o past key values) are merged into a single model, without any loss in precision.
- tiny: -61.8% (285MB → 109MB)
- base: -57.6% (583MB → 247MB)

New quantizations (including 4-bit and 8-bit), further reducing the size of the models with minimal differences in output. Note that the q4 quantizations only target MatMul ops, which is why the size is larger than q8 quantization.

tiny: 55.1MB at 4-bit quantization, 28MB at 8-bit quantization. Sample outputs:

fp32: ['Ever tried ever failed, no matter try again, fail again, fail better.']
q4: ['Ever tried, ever failed, no matter, try again, fail again, fail better.']
q8: ['Ever tried. Ever failed. No matter. Try again. Fail again. Fail better.']

base: 98MB at 4-bit quantization, 63MB at 8-bit quantization

fp32: ['Ever tried ever failed, no matter try again fail again fail better.']
q4: ['Ever tried ever failed, no matter try again, fail again, fail better.']
q8 decoder, fp32 encoder: ['Ever tried ever failed, no matter try again fail again fail better.']

(q8 encoder in last case produces poor results)

Improved loading and execution times, as benchmarked with the following code. Note that these benchmarks do not include downloading time, only loading times (i.e., the models were already downloaded).

import moonshine_onnx as moonshine
import time

for i in range(10):
    start_time = time.time()
    output = moonshine.transcribe(moonshine.ASSETS_DIR / 'beckett.wav', 'moonshine/tiny')
    end_time = time.time()

    print(f"Execution time: {end_time - start_time} seconds")

tiny:
base:

xenova · 2024-12-14T22:13:10Z

The differences become more apparent if we separate the loading and execution:

Model	Old Load Time (s)	New Load Time (s)	Load Time Reduction (%)	Old Run Time (s)	New Run Time (s)	Run Time Reduction (%)
Tiny	1.594	1.090	31.6	0.346	0.252	27.1
Base	2.442	1.333	45.4	0.605	0.465	23.1

Benchmarking code

import time
import moonshine_onnx as moonshine
from moonshine_onnx.model import MoonshineOnnxModel
from moonshine_onnx.transcribe import load_audio

audio = load_audio(moonshine.ASSETS_DIR / 'beckett.wav')

load_start_time = time.time()
model = MoonshineOnnxModel(model_name='moonshine/base')
load_end_time = time.time()
print(f"Model load time: {load_end_time - load_start_time} seconds")

for i in range(10):
    start_time = time.time()
    tokens = model.generate(audio)
    end_time = time.time()

    print(f"Run #{i+1}: {end_time - start_time} seconds")

Raw data

Tiny

Old

Model load time: 1.5940916538238525 seconds
Run #1: 0.3504812717437744 seconds
Run #2: 0.3556952476501465 seconds
Run #3: 0.46249866485595703 seconds
Run #4: 0.3608577251434326 seconds
Run #5: 0.29972147941589355 seconds
Run #6: 0.3081827163696289 seconds
Run #7: 0.33364224433898926 seconds
Run #8: 0.3344881534576416 seconds
Run #9: 0.3328516483306885 seconds
Run #10: 0.31997060775756836 seconds

New

Model load time: 1.0903031826019287 seconds
Run #1: 0.22372031211853027 seconds
Run #2: 0.2659788131713867 seconds
Run #3: 0.2293243408203125 seconds
Run #4: 0.2531099319458008 seconds
Run #5: 0.23910117149353027 seconds
Run #6: 0.2526216506958008 seconds
Run #7: 0.230133056640625 seconds
Run #8: 0.28861451148986816 seconds
Run #9: 0.23595857620239258 seconds
Run #10: 0.3022029399871826 seconds

Base

Old

Model load time: 2.4421446323394775 seconds
Run #1: 0.6086058616638184 seconds
Run #2: 0.5442285537719727 seconds
Run #3: 0.609248161315918 seconds
Run #4: 0.6299099922180176 seconds
Run #5: 0.6160895824432373 seconds
Run #6: 0.57456374168396 seconds
Run #7: 0.6728155612945557 seconds
Run #8: 0.5604102611541748 seconds
Run #9: 0.6053454875946045 seconds
Run #10: 0.625809907913208 seconds

New

Model load time: 1.333482027053833 seconds
Run #1: 0.43103766441345215 seconds
Run #2: 0.4757063388824463 seconds
Run #3: 0.4413950443267822 seconds
Run #4: 0.44200587272644043 seconds
Run #5: 0.4499983787536621 seconds
Run #6: 0.5306398868560791 seconds
Run #7: 0.47008252143859863 seconds
Run #8: 0.481827974319458 seconds
Run #9: 0.47646117210388184 seconds
Run #10: 0.45035600662231445 seconds

keveman · 2024-12-15T00:31:04Z

Hi @xenova , this is so amazing! Thanks for the PR.
Any chance you can share the scripts used for generating the ONNX files?

xenova · 2024-12-15T03:31:44Z

Absolutely! It's using a custom dev build of Optimum, which I'll publish soon. It's very similar to the whisper conversion config.

Will do later today 🔥

keveman · 2024-12-16T19:02:12Z

@xenova Ok to merge this, but will be really grateful for the code to generate the onnx files.

xenova · 2024-12-16T20:59:29Z

Sure! Just a reminder these are all on dev branches still, and will ready for use when huggingface/transformers#34784 is merged.

Here are the steps to convert:

Install the dev branch of Optimum

pip install --upgrade git+https://github.com/huggingface/optimum.git@add-moonshine-onnx

Install moonshine dev branch of transformers:

pip install --upgrade git+https://github.com/eustlb/transformers.git@add-moonshine

Convert the model to ONNX

optimum-cli export onnx -m Xenova/moonshine-tiny-hf ./output/

Note: I've uploaded transformers-compatible versions of the models to my HF account, but I'm happy to move these to your organization, if you'd like. (I can join the org, move, then leave, or you can simply clone the model yourself).

keveman · 2024-12-16T22:13:36Z

Note: I've uploaded transformers-compatible versions of the models to my HF account, but I'm happy to move these to your organization, if you'd like. (I can join the org, move, then leave, or you can simply clone the model yourself).

https://huggingface.co/Xenova/moonshine-tiny-hf

https://huggingface.co/Xenova/moonshine-base-hf

Sent you an invite to join the usefulsensors org on HF, please move it there.

xenova · 2024-12-16T23:23:17Z

Sent you an invite to join the usefulsensors org on HF, please move it there.

Requested to join 👍 (I didn't see an invite, yet. Username is Xenova)

petewarden · 2024-12-17T23:00:43Z

Thanks so much for this @xenova, this is extremely useful!

I'm actually working on quantization of these models too. So far I've found running the default ONNX Runtime quantize_dynamic() process has a big hit on accuracy, so I'm going to be digging a bit deeper when I get time. I'm using LibriSpeech English clean as my test set, and while I'm hoping to get the script properly added to the Moonshine repo soon, here's a gist of it in case it's useful for your work: https://gist.github.com/petewarden/09a17d2ded03d24e445c7e7681517ee9

You run it like:

py .\librispeech_wer.py --models_dir "C:\Users\pete\projects\models\xenova\tiny\quantized" --model_name "moonshine/tiny"

If you tell me which versions of the files you recommend I should be using (beyond the original float32 versions, which I've confirmed suffer no accuracy loss, as expected) I'll generate some accuracy numbers for those on my end. So far I've got 30.9% WER for the tiny _quantized variant, I'll keep working through the others.

Thanks again for this work, I know it will be helpful to a lot of people.

xenova · 2024-12-18T00:07:07Z

I'm particularly interested in the _q4 variants, as these are very fast on WebGPU, so doing some evals on that would be great! Also, using the _quantized (a.k.a., q8) variant for the encoder can cause some issues, so maybe some hybrid testing (i.e., fp32 for encoder, q8 or q4 for decoder)?

The fp16 models are currently broken (a weird subgraph issue I'm trying to figure out) and we're looking into fixing that 🫡

petewarden · 2024-12-18T20:58:02Z

Great, thanks @xenova! The base _quantized WER I get is 16.64%, I'll try your suggestion of float encoders and quantized decoders.

Since you're targeting the web, and presumably file size is a big factor for you too, you might be interested in some experiments I've been doing with pure weight quantization, and reducing compressed file sizes for Moonshine: https://github.com/usefulsensors/onnx_shrink_ray

xenova · 2024-12-18T21:18:49Z

I'll check out that repo! Regarding file size, remember to deduplicate the the tied weights (as this significantly increases size).

For example, at fp32, the tiny model is 109.1MB (30.9+78.2 MB):

and the fp32 base model is 246.8MB (166+80.8MB)

petewarden · 2024-12-19T02:02:34Z

remember to deduplicate the the tied weights (as this significantly increases size).

Definitely, I'll be trying the weight-only quantization on your float merged decoder models, it should help a lot.

I see 4.55% WER for tiny using the float encoder and q8 decoder, so you're right the accuracy issues seem to be on the encoder side. I'm trying a float encoder and q4 decoder now and will let you know what I find.

Hopefully if I do some layer-by-layer comparisons between the float encoder and quantized version I can identify the problematic ops and exclude them from quantization, but I might not get to that for a few days.

petewarden · 2024-12-19T22:22:29Z

Tiny float encoder and q4 decoder gives a 4.84% WER, so the accuracy holds up well.

I did try my quantization approach to shrink the merged files, but ran into a bug in my code so they actually came out larger! I'll get back to that when I get a chance, but for now I'll prioritize figuring out why the encoder doesn't work well with activations quantized.

petewarden · 2024-12-31T23:47:46Z

I've created a script to compare the dynamic quantization activation layers with the float originals. Here are the results of one tiny run as a spreadsheet and I've inlined the start of it below.

I was hoping there might be a single smoking gun, but it looks like more gradual degradation. The Erf nodes definitely aren't helping, but they're also downstream of the quantization, so it's more that they're amplifying earlier errors from the first Conv than they're introducing their own.

My next steps will be removing all quantization and then reintroducing it to the MatMul nodes, and then others, to see where things break down.

petewarden · 2025-01-02T01:24:28Z

I've improved the accuracy of the encoder a lot by using the quantize_dynamic() function but excluding Conv nodes. The resulting files are available at:

These were created using Shrink Ray commands like:

py .\shrink.py --output_suffix ".onnx" --output_dir xenova\base\int_activations_matmul_mul --method "integer_activations" --op_types_to_quantize "MatMul,Mul"

In this case, the call to shrink.py is a very thin wrapper around the standard ORT quantize_dynamic() function.

I now see a WER of 4.57% for tiny and 3.32% for base, so almost exactly the same as the float versions. Oddly enough the file sizes aren't much reduced, though I've confirmed with Netron that the DynamicQuantizeLinear ops are present in the expected places. I'll have to investigate what's happening with those weight constants.

petewarden · 2025-01-02T19:44:09Z

And thinking about this, it makes sense that the initial conv can't be eight bit, since it's taking in sixteen bit audio samples to create the features, and so we're losing a lot of information there. I think @keveman tried to explain this to me earlier but I failed to get it. :)

petewarden · 2025-01-04T02:29:06Z

I've fixed Shrink Ray to handle subgraphs (which was the issue with the merged decoder) and I have some new files up at:

https://drive.google.com/file/d/1Nrtu8oO-j2VvBIhXjl3m2wEbfJug6_fC/view?usp=sharing
https://drive.google.com/file/d/1xax4dz-IdoxpjQkbbDvwQcA4O6gXmD_6/view?usp=sharing

I see WERs of 4.71% for tiny, and 3.32% for base. The uncompressed file sizes are 35MB and 83MB, gzip gives 27MB and 63MB, brotli gets to 20MB and 61MB.

There's probably still some work to do replacing the DequantizeLinear ops (which are very slow) with a Cast and the appropriate Mul and Add, but I'll look at that once I have some latency numbers to see what difference that might make.

For posterity, the commands to create these from the original float models provided are:

python .\shrink.py --output_suffix ".onnx" --output_dir xenova\tiny\int_activations_matmul_mul --method "integer_activations" --op_types_to_quantize "MatMul,Mul" xenova\tiny\original
python .\shrink.py --method "integer_weights" --output_suffix ".onnx" --output_dir xenova\tiny\int_activations_matmul_mul_shrunk\ xenova\tiny\int_activations_matmul_mul

petewarden · 2025-01-16T23:54:58Z

I've finally got back to this, and have fixed some issues with the weight quantization (replaced the slow DequantizeLinear op with equivalent casts, muls, and adds, removed duplicate initializers) in shrink ray and I now have working tiny and base models available.

All of the initial feature generation convs are left as float32, but the rest of the models are quantized dynamically. The accuracy holds up (4.75% WER on LibriSpeech clean for tiny, 3.30% WER for base) and the resulting models are small. Tiny is 26MB uncompressed, and 20MB compressed with either gzip or brotli. Base is 59MB uncompressed, 45MB for gzip, and 43MB for brotli.

For posterity, here are the commands I used to generate these:

python shrink.py --output_suffix ".onnx" --output_dir int_activations_matmul_mul --method "integer_activations" --nodes_to_exclude "/conv1/Conv,/conv2/Conv,/conv3/Conv" original
python shrink.py --method "integer_weights" --output_suffix ".onnx" --output_dir int_activations_matmul_mul_shrunk int_activations_matmul_mul

Now we have accurate quantized models that are as small as possible, I'll work with @keveman on getting this change landed. I'd be interested in seeing the q4 results too, if you can share the commands you used to generate those @xenova?

xenova · 2025-01-17T00:42:01Z

Wow this is phenomenal work 🤯 Great stuff!

Here's how to make the q4 quantizations:

import onnx
from onnxruntime.quantization.matmul_4bits_quantizer import MatMul4BitsQuantizer
from optimum.onnx.graph_transformations import check_and_save_model

model = onnx.load('./path/to/model.onnx')
quantizer = MatMul4BitsQuantizer(
    model=model,
    # Optional:
    # block_size=block_size,
    # is_symmetric=is_symmetric,
    # accuracy_level=accuracy_level,
)
quantizer.process()
check_and_save_model(quantizer.model.model, save_path)

Would you like to upload those weights to a repo on the HF hub? Ideally, structured like https://huggingface.co/onnx-community/moonshine-base-ONNX. I can help if you'd like.

xenova · 2025-01-17T12:11:21Z

@petewarden On a similar note, I think a model like https://huggingface.co/onnx-community/Kokoro-82M-ONNX could greatly benefit from onnx shrink ray. 👀

petewarden · 2025-02-03T21:14:32Z

It's taken me a while to get back to this, mostly because I've been trying to hack static_quantization() to understand subgraphs. Unfortunately this turned out to be a pretty big task, so I'm going to abandon that and focus on cleaning up the dynamically quantized models, try out 4-bit, and prepare to get them uploaded. Thanks for the hint on the 4-bit quantization, I'd missed that was officially supported!

petewarden · 2025-02-03T23:36:37Z

Here's the HuggingFace PR for adding the files: https://huggingface.co/UsefulSensors/moonshine/discussions/6

Once that's in, I'll try to shepherd this PR in as well.

petewarden · 2025-02-04T00:07:40Z

I think this is ready to be committed now, I've asked @keveman for a final check.

keveman

Code looks good, but looks like it might fail the pre submit checks, as I see a missing new line. @evmaki can you please take a look, modify if needed for style checks, and merge it?

evmaki · 2025-02-04T02:13:55Z

Applied formatting and manually rebased and merged with main.

xenova added 3 commits December 14, 2024 21:14

Improve ONNX version

7fc4ffd

Use fp32 by default

7f70a6c

Formatting

d215d9a

keveman previously approved these changes Dec 16, 2024

View reviewed changes

Merge branch 'main' into main

7d809b3

petewarden dismissed keveman’s stale review via 7d809b3 February 3, 2025 23:48

Load weights from new hub location

44aafb4

keveman requested a review from evmaki February 4, 2025 00:46

keveman previously approved these changes Feb 4, 2025

View reviewed changes

Apply code formatting

5ad07bd

evmaki dismissed keveman’s stale review via 5ad07bd February 4, 2025 01:19

evmaki approved these changes Feb 4, 2025

View reviewed changes

evmaki closed this Feb 4, 2025

ONNX improvements (-62% in full-precision model size, 2.7x faster load and execution, quantizations) #73

ONNX improvements (-62% in full-precision model size, 2.7x faster load and execution, quantizations) #73

Uh oh!

Conversation

xenova commented Dec 14, 2024

Uh oh!

xenova commented Dec 14, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Benchmarking code

Raw data

Tiny

Old

New

Base

Old

New

Uh oh!

keveman commented Dec 15, 2024

Uh oh!

xenova commented Dec 15, 2024

Uh oh!

keveman commented Dec 16, 2024

Uh oh!

xenova commented Dec 16, 2024

Uh oh!

keveman commented Dec 16, 2024

Uh oh!

xenova commented Dec 16, 2024

Uh oh!

petewarden commented Dec 17, 2024

Uh oh!

xenova commented Dec 18, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

petewarden commented Dec 18, 2024

Uh oh!

xenova commented Dec 18, 2024

Uh oh!

petewarden commented Dec 19, 2024

Uh oh!

petewarden commented Dec 19, 2024

Uh oh!

petewarden commented Dec 31, 2024

Uh oh!

petewarden commented Jan 2, 2025

Uh oh!

petewarden commented Jan 2, 2025

Uh oh!

petewarden commented Jan 4, 2025

Uh oh!

petewarden commented Jan 16, 2025

Uh oh!

xenova commented Jan 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

xenova commented Jan 17, 2025

Uh oh!

petewarden commented Feb 3, 2025

Uh oh!

petewarden commented Feb 3, 2025

Uh oh!

petewarden commented Feb 4, 2025

Uh oh!

keveman left a comment

Choose a reason for hiding this comment

Uh oh!

evmaki commented Feb 4, 2025

Uh oh!

Uh oh!

xenova commented Dec 14, 2024 •

edited

Loading

xenova commented Dec 18, 2024 •

edited

Loading

xenova commented Jan 17, 2025 •

edited

Loading