Skip to content

ONNX improvements (-62% in full-precision model size, 2.7x faster load and execution, quantizations) #73

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 6 commits into from

Conversation

xenova
Copy link
Contributor

@xenova xenova commented Dec 14, 2024

This PR improves the moonshine-onnx package in the following ways:

  • Significant reductions in model size / downloads for both tiny and base models by ensuring tied weights are not duplicated, and that the decoders (w/ and w/o past key values) are merged into a single model, without any loss in precision.

    • tiny: -61.8% (285MB → 109MB)
    • base: -57.6% (583MB → 247MB)
  • New quantizations (including 4-bit and 8-bit), further reducing the size of the models with minimal differences in output. Note that the q4 quantizations only target MatMul ops, which is why the size is larger than q8 quantization.

    • tiny: 55.1MB at 4-bit quantization, 28MB at 8-bit quantization. Sample outputs:
      fp32: ['Ever tried ever failed, no matter try again, fail again, fail better.']
      q4: ['Ever tried, ever failed, no matter, try again, fail again, fail better.']
      q8: ['Ever tried. Ever failed. No matter. Try again. Fail again. Fail better.']
      
    • base: 98MB at 4-bit quantization, 63MB at 8-bit quantization
      fp32: ['Ever tried ever failed, no matter try again fail again fail better.']
      q4: ['Ever tried ever failed, no matter try again, fail again, fail better.']
      q8 decoder, fp32 encoder: ['Ever tried ever failed, no matter try again fail again fail better.']
      
      (q8 encoder in last case produces poor results)
  • Improved loading and execution times, as benchmarked with the following code. Note that these benchmarks do not include downloading time, only loading times (i.e., the models were already downloaded).

    import moonshine_onnx as moonshine
    import time
    
    for i in range(10):
        start_time = time.time()
        output = moonshine.transcribe(moonshine.ASSETS_DIR / 'beckett.wav', 'moonshine/tiny')
        end_time = time.time()
    
        print(f"Execution time: {end_time - start_time} seconds")
    • tiny:
      image

    • base:
      image

@xenova
Copy link
Contributor Author

xenova commented Dec 14, 2024

The differences become more apparent if we separate the loading and execution:

Model Old Load Time (s) New Load Time (s) Load Time Reduction (%) Old Run Time (s) New Run Time (s) Run Time Reduction (%)
Tiny 1.594 1.090 31.6 0.346 0.252 27.1
Base 2.442 1.333 45.4 0.605 0.465 23.1

Benchmarking code

import time
import moonshine_onnx as moonshine
from moonshine_onnx.model import MoonshineOnnxModel
from moonshine_onnx.transcribe import load_audio

audio = load_audio(moonshine.ASSETS_DIR / 'beckett.wav')

load_start_time = time.time()
model = MoonshineOnnxModel(model_name='moonshine/base')
load_end_time = time.time()
print(f"Model load time: {load_end_time - load_start_time} seconds")

for i in range(10):
    start_time = time.time()
    tokens = model.generate(audio)
    end_time = time.time()

    print(f"Run #{i+1}: {end_time - start_time} seconds")

Raw data

Tiny

Old

Model load time: 1.5940916538238525 seconds
Run #1: 0.3504812717437744 seconds
Run #2: 0.3556952476501465 seconds
Run #3: 0.46249866485595703 seconds
Run #4: 0.3608577251434326 seconds
Run #5: 0.29972147941589355 seconds
Run #6: 0.3081827163696289 seconds
Run #7: 0.33364224433898926 seconds
Run #8: 0.3344881534576416 seconds
Run #9: 0.3328516483306885 seconds
Run #10: 0.31997060775756836 seconds

New

Model load time: 1.0903031826019287 seconds
Run #1: 0.22372031211853027 seconds
Run #2: 0.2659788131713867 seconds
Run #3: 0.2293243408203125 seconds
Run #4: 0.2531099319458008 seconds
Run #5: 0.23910117149353027 seconds
Run #6: 0.2526216506958008 seconds
Run #7: 0.230133056640625 seconds
Run #8: 0.28861451148986816 seconds
Run #9: 0.23595857620239258 seconds
Run #10: 0.3022029399871826 seconds

Base

Old

Model load time: 2.4421446323394775 seconds
Run #1: 0.6086058616638184 seconds
Run #2: 0.5442285537719727 seconds
Run #3: 0.609248161315918 seconds
Run #4: 0.6299099922180176 seconds
Run #5: 0.6160895824432373 seconds
Run #6: 0.57456374168396 seconds
Run #7: 0.6728155612945557 seconds
Run #8: 0.5604102611541748 seconds
Run #9: 0.6053454875946045 seconds
Run #10: 0.625809907913208 seconds

New

Model load time: 1.333482027053833 seconds
Run #1: 0.43103766441345215 seconds
Run #2: 0.4757063388824463 seconds
Run #3: 0.4413950443267822 seconds
Run #4: 0.44200587272644043 seconds
Run #5: 0.4499983787536621 seconds
Run #6: 0.5306398868560791 seconds
Run #7: 0.47008252143859863 seconds
Run #8: 0.481827974319458 seconds
Run #9: 0.47646117210388184 seconds
Run #10: 0.45035600662231445 seconds

@keveman
Copy link
Contributor

keveman commented Dec 15, 2024

Hi @xenova , this is so amazing! Thanks for the PR.
Any chance you can share the scripts used for generating the ONNX files?

@xenova
Copy link
Contributor Author

xenova commented Dec 15, 2024

Absolutely! It's using a custom dev build of Optimum, which I'll publish soon. It's very similar to the whisper conversion config.

Will do later today 🔥

@keveman
Copy link
Contributor

keveman commented Dec 16, 2024

@xenova Ok to merge this, but will be really grateful for the code to generate the onnx files.

keveman
keveman previously approved these changes Dec 16, 2024
@xenova
Copy link
Contributor Author

xenova commented Dec 16, 2024

Sure! Just a reminder these are all on dev branches still, and will ready for use when huggingface/transformers#34784 is merged.

Here are the steps to convert:

  1. Install the dev branch of Optimum
pip install --upgrade git+https://github.com/huggingface/optimum.git@add-moonshine-onnx
  1. Install moonshine dev branch of transformers:
pip install --upgrade git+https://github.com/eustlb/transformers.git@add-moonshine
  1. Convert the model to ONNX
optimum-cli export onnx -m Xenova/moonshine-tiny-hf ./output/

Note: I've uploaded transformers-compatible versions of the models to my HF account, but I'm happy to move these to your organization, if you'd like. (I can join the org, move, then leave, or you can simply clone the model yourself).

@keveman
Copy link
Contributor

keveman commented Dec 16, 2024

Note: I've uploaded transformers-compatible versions of the models to my HF account, but I'm happy to move these to your organization, if you'd like. (I can join the org, move, then leave, or you can simply clone the model yourself).

Sent you an invite to join the usefulsensors org on HF, please move it there.

@xenova
Copy link
Contributor Author

xenova commented Dec 16, 2024

Sent you an invite to join the usefulsensors org on HF, please move it there.

Requested to join 👍 (I didn't see an invite, yet. Username is Xenova)

@petewarden
Copy link

Thanks so much for this @xenova, this is extremely useful!

I'm actually working on quantization of these models too. So far I've found running the default ONNX Runtime quantize_dynamic() process has a big hit on accuracy, so I'm going to be digging a bit deeper when I get time. I'm using LibriSpeech English clean as my test set, and while I'm hoping to get the script properly added to the Moonshine repo soon, here's a gist of it in case it's useful for your work: https://gist.github.com/petewarden/09a17d2ded03d24e445c7e7681517ee9

You run it like:

py .\librispeech_wer.py --models_dir "C:\Users\pete\projects\models\xenova\tiny\quantized" --model_name "moonshine/tiny"

If you tell me which versions of the files you recommend I should be using (beyond the original float32 versions, which I've confirmed suffer no accuracy loss, as expected) I'll generate some accuracy numbers for those on my end. So far I've got 30.9% WER for the tiny _quantized variant, I'll keep working through the others.

Thanks again for this work, I know it will be helpful to a lot of people.

@xenova
Copy link
Contributor Author

xenova commented Dec 18, 2024

I'm particularly interested in the _q4 variants, as these are very fast on WebGPU, so doing some evals on that would be great! Also, using the _quantized (a.k.a., q8) variant for the encoder can cause some issues, so maybe some hybrid testing (i.e., fp32 for encoder, q8 or q4 for decoder)?

The fp16 models are currently broken (a weird subgraph issue I'm trying to figure out) and we're looking into fixing that 🫡

@petewarden
Copy link

Great, thanks @xenova! The base _quantized WER I get is 16.64%, I'll try your suggestion of float encoders and quantized decoders.

Since you're targeting the web, and presumably file size is a big factor for you too, you might be interested in some experiments I've been doing with pure weight quantization, and reducing compressed file sizes for Moonshine: https://github.com/usefulsensors/onnx_shrink_ray

@xenova
Copy link
Contributor Author

xenova commented Dec 18, 2024

I'll check out that repo! Regarding file size, remember to deduplicate the the tied weights (as this significantly increases size).

For example, at fp32, the tiny model is 109.1MB (30.9+78.2 MB):

image

image

and the fp32 base model is 246.8MB (166+80.8MB)

image

image

@petewarden
Copy link

remember to deduplicate the the tied weights (as this significantly increases size).

Definitely, I'll be trying the weight-only quantization on your float merged decoder models, it should help a lot.

I see 4.55% WER for tiny using the float encoder and q8 decoder, so you're right the accuracy issues seem to be on the encoder side. I'm trying a float encoder and q4 decoder now and will let you know what I find.

Hopefully if I do some layer-by-layer comparisons between the float encoder and quantized version I can identify the problematic ops and exclude them from quantization, but I might not get to that for a few days.

@petewarden
Copy link

Tiny float encoder and q4 decoder gives a 4.84% WER, so the accuracy holds up well.

I did try my quantization approach to shrink the merged files, but ran into a bug in my code so they actually came out larger! I'll get back to that when I get a chance, but for now I'll prioritize figuring out why the encoder doesn't work well with activations quantized.

@petewarden
Copy link

I've created a script to compare the dynamic quantization activation layers with the float originals. Here are the results of one tiny run as a spreadsheet and I've inlined the start of it below.

image

I was hoping there might be a single smoking gun, but it looks like more gradual degradation. The Erf nodes definitely aren't helping, but they're also downstream of the quantization, so it's more that they're amplifying earlier errors from the first Conv than they're introducing their own.

My next steps will be removing all quantization and then reintroducing it to the MatMul nodes, and then others, to see where things break down.

@petewarden
Copy link

I've improved the accuracy of the encoder a lot by using the quantize_dynamic() function but excluding Conv nodes. The resulting files are available at:

These were created using Shrink Ray commands like:

py .\shrink.py --output_suffix ".onnx" --output_dir xenova\base\int_activations_matmul_mul --method "integer_activations" --op_types_to_quantize "MatMul,Mul"

In this case, the call to shrink.py is a very thin wrapper around the standard ORT quantize_dynamic() function.

I now see a WER of 4.57% for tiny and 3.32% for base, so almost exactly the same as the float versions. Oddly enough the file sizes aren't much reduced, though I've confirmed with Netron that the DynamicQuantizeLinear ops are present in the expected places. I'll have to investigate what's happening with those weight constants.

@petewarden
Copy link

And thinking about this, it makes sense that the initial conv can't be eight bit, since it's taking in sixteen bit audio samples to create the features, and so we're losing a lot of information there. I think @keveman tried to explain this to me earlier but I failed to get it. :)

@petewarden
Copy link

I've fixed Shrink Ray to handle subgraphs (which was the issue with the merged decoder) and I have some new files up at:

https://drive.google.com/file/d/1Nrtu8oO-j2VvBIhXjl3m2wEbfJug6_fC/view?usp=sharing
https://drive.google.com/file/d/1xax4dz-IdoxpjQkbbDvwQcA4O6gXmD_6/view?usp=sharing

I see WERs of 4.71% for tiny, and 3.32% for base. The uncompressed file sizes are 35MB and 83MB, gzip gives 27MB and 63MB, brotli gets to 20MB and 61MB.

There's probably still some work to do replacing the DequantizeLinear ops (which are very slow) with a Cast and the appropriate Mul and Add, but I'll look at that once I have some latency numbers to see what difference that might make.

For posterity, the commands to create these from the original float models provided are:

python .\shrink.py --output_suffix ".onnx" --output_dir xenova\tiny\int_activations_matmul_mul --method "integer_activations" --op_types_to_quantize "MatMul,Mul" xenova\tiny\original
python .\shrink.py --method "integer_weights" --output_suffix ".onnx" --output_dir xenova\tiny\int_activations_matmul_mul_shrunk\ xenova\tiny\int_activations_matmul_mul

@petewarden
Copy link

I've finally got back to this, and have fixed some issues with the weight quantization (replaced the slow DequantizeLinear op with equivalent casts, muls, and adds, removed duplicate initializers) in shrink ray and I now have working tiny and base models available.

All of the initial feature generation convs are left as float32, but the rest of the models are quantized dynamically. The accuracy holds up (4.75% WER on LibriSpeech clean for tiny, 3.30% WER for base) and the resulting models are small. Tiny is 26MB uncompressed, and 20MB compressed with either gzip or brotli. Base is 59MB uncompressed, 45MB for gzip, and 43MB for brotli.

For posterity, here are the commands I used to generate these:

python shrink.py --output_suffix ".onnx" --output_dir int_activations_matmul_mul --method "integer_activations" --nodes_to_exclude "/conv1/Conv,/conv2/Conv,/conv3/Conv" original
python shrink.py --method "integer_weights" --output_suffix ".onnx" --output_dir int_activations_matmul_mul_shrunk int_activations_matmul_mul

Now we have accurate quantized models that are as small as possible, I'll work with @keveman on getting this change landed. I'd be interested in seeing the q4 results too, if you can share the commands you used to generate those @xenova?

@xenova
Copy link
Contributor Author

xenova commented Jan 17, 2025

Wow this is phenomenal work 🤯 Great stuff!

Here's how to make the q4 quantizations:

import onnx
from onnxruntime.quantization.matmul_4bits_quantizer import MatMul4BitsQuantizer
from optimum.onnx.graph_transformations import check_and_save_model

model = onnx.load('./path/to/model.onnx')
quantizer = MatMul4BitsQuantizer(
    model=model,
    # Optional:
    # block_size=block_size,
    # is_symmetric=is_symmetric,
    # accuracy_level=accuracy_level,
)
quantizer.process()
check_and_save_model(quantizer.model.model, save_path)

Would you like to upload those weights to a repo on the HF hub? Ideally, structured like https://huggingface.co/onnx-community/moonshine-base-ONNX. I can help if you'd like.

@xenova
Copy link
Contributor Author

xenova commented Jan 17, 2025

@petewarden On a similar note, I think a model like https://huggingface.co/onnx-community/Kokoro-82M-ONNX could greatly benefit from onnx shrink ray. 👀

@petewarden
Copy link

It's taken me a while to get back to this, mostly because I've been trying to hack static_quantization() to understand subgraphs. Unfortunately this turned out to be a pretty big task, so I'm going to abandon that and focus on cleaning up the dynamically quantized models, try out 4-bit, and prepare to get them uploaded. Thanks for the hint on the 4-bit quantization, I'd missed that was officially supported!

@petewarden
Copy link

Here's the HuggingFace PR for adding the files: https://huggingface.co/UsefulSensors/moonshine/discussions/6

Once that's in, I'll try to shepherd this PR in as well.

@petewarden
Copy link

I think this is ready to be committed now, I've asked @keveman for a final check.

@keveman keveman requested a review from evmaki February 4, 2025 00:46
keveman
keveman previously approved these changes Feb 4, 2025
Copy link
Contributor

@keveman keveman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code looks good, but looks like it might fail the pre submit checks, as I see a missing new line. @evmaki can you please take a look, modify if needed for style checks, and merge it?

@evmaki
Copy link
Contributor

evmaki commented Feb 4, 2025

Applied formatting and manually rebased and merged with main.

@evmaki evmaki closed this Feb 4, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants