-
Notifications
You must be signed in to change notification settings - Fork 144
ONNX improvements (-62% in full-precision model size, 2.7x faster load and execution, quantizations) #73
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
The differences become more apparent if we separate the loading and execution:
Benchmarking codeimport time
import moonshine_onnx as moonshine
from moonshine_onnx.model import MoonshineOnnxModel
from moonshine_onnx.transcribe import load_audio
audio = load_audio(moonshine.ASSETS_DIR / 'beckett.wav')
load_start_time = time.time()
model = MoonshineOnnxModel(model_name='moonshine/base')
load_end_time = time.time()
print(f"Model load time: {load_end_time - load_start_time} seconds")
for i in range(10):
start_time = time.time()
tokens = model.generate(audio)
end_time = time.time()
print(f"Run #{i+1}: {end_time - start_time} seconds") Raw dataTinyOld
New
BaseOld
New
|
Hi @xenova , this is so amazing! Thanks for the PR. |
Absolutely! It's using a custom dev build of Optimum, which I'll publish soon. It's very similar to the whisper conversion config. Will do later today 🔥 |
@xenova Ok to merge this, but will be really grateful for the code to generate the onnx files. |
Sure! Just a reminder these are all on dev branches still, and will ready for use when huggingface/transformers#34784 is merged. Here are the steps to convert:
Note: I've uploaded transformers-compatible versions of the models to my HF account, but I'm happy to move these to your organization, if you'd like. (I can join the org, move, then leave, or you can simply clone the model yourself). |
Sent you an invite to join the usefulsensors org on HF, please move it there. |
Requested to join 👍 (I didn't see an invite, yet. Username is Xenova) |
Thanks so much for this @xenova, this is extremely useful! I'm actually working on quantization of these models too. So far I've found running the default ONNX Runtime You run it like: py .\librispeech_wer.py --models_dir "C:\Users\pete\projects\models\xenova\tiny\quantized" --model_name "moonshine/tiny" If you tell me which versions of the files you recommend I should be using (beyond the original float32 versions, which I've confirmed suffer no accuracy loss, as expected) I'll generate some accuracy numbers for those on my end. So far I've got 30.9% WER for the tiny Thanks again for this work, I know it will be helpful to a lot of people. |
I'm particularly interested in the The fp16 models are currently broken (a weird subgraph issue I'm trying to figure out) and we're looking into fixing that 🫡 |
Great, thanks @xenova! The base _quantized WER I get is 16.64%, I'll try your suggestion of float encoders and quantized decoders. Since you're targeting the web, and presumably file size is a big factor for you too, you might be interested in some experiments I've been doing with pure weight quantization, and reducing compressed file sizes for Moonshine: https://github.com/usefulsensors/onnx_shrink_ray |
Definitely, I'll be trying the weight-only quantization on your float merged decoder models, it should help a lot. I see 4.55% WER for tiny using the float encoder and q8 decoder, so you're right the accuracy issues seem to be on the encoder side. I'm trying a float encoder and q4 decoder now and will let you know what I find. Hopefully if I do some layer-by-layer comparisons between the float encoder and quantized version I can identify the problematic ops and exclude them from quantization, but I might not get to that for a few days. |
Tiny float encoder and q4 decoder gives a 4.84% WER, so the accuracy holds up well. I did try my quantization approach to shrink the merged files, but ran into a bug in my code so they actually came out larger! I'll get back to that when I get a chance, but for now I'll prioritize figuring out why the encoder doesn't work well with activations quantized. |
I've created a script to compare the dynamic quantization activation layers with the float originals. Here are the results of one tiny run as a spreadsheet and I've inlined the start of it below. I was hoping there might be a single smoking gun, but it looks like more gradual degradation. The Erf nodes definitely aren't helping, but they're also downstream of the quantization, so it's more that they're amplifying earlier errors from the first Conv than they're introducing their own. My next steps will be removing all quantization and then reintroducing it to the MatMul nodes, and then others, to see where things break down. |
I've improved the accuracy of the encoder a lot by using the
These were created using Shrink Ray commands like: py .\shrink.py --output_suffix ".onnx" --output_dir xenova\base\int_activations_matmul_mul --method "integer_activations" --op_types_to_quantize "MatMul,Mul" In this case, the call to shrink.py is a very thin wrapper around the standard ORT I now see a WER of 4.57% for tiny and 3.32% for base, so almost exactly the same as the float versions. Oddly enough the file sizes aren't much reduced, though I've confirmed with Netron that the DynamicQuantizeLinear ops are present in the expected places. I'll have to investigate what's happening with those weight constants. |
And thinking about this, it makes sense that the initial conv can't be eight bit, since it's taking in sixteen bit audio samples to create the features, and so we're losing a lot of information there. I think @keveman tried to explain this to me earlier but I failed to get it. :) |
I've fixed Shrink Ray to handle subgraphs (which was the issue with the merged decoder) and I have some new files up at: https://drive.google.com/file/d/1Nrtu8oO-j2VvBIhXjl3m2wEbfJug6_fC/view?usp=sharing I see WERs of 4.71% for tiny, and 3.32% for base. The uncompressed file sizes are 35MB and 83MB, gzip gives 27MB and 63MB, brotli gets to 20MB and 61MB. There's probably still some work to do replacing the DequantizeLinear ops (which are very slow) with a Cast and the appropriate Mul and Add, but I'll look at that once I have some latency numbers to see what difference that might make. For posterity, the commands to create these from the original float models provided are: python .\shrink.py --output_suffix ".onnx" --output_dir xenova\tiny\int_activations_matmul_mul --method "integer_activations" --op_types_to_quantize "MatMul,Mul" xenova\tiny\original
python .\shrink.py --method "integer_weights" --output_suffix ".onnx" --output_dir xenova\tiny\int_activations_matmul_mul_shrunk\ xenova\tiny\int_activations_matmul_mul |
I've finally got back to this, and have fixed some issues with the weight quantization (replaced the slow DequantizeLinear op with equivalent casts, muls, and adds, removed duplicate initializers) in shrink ray and I now have working tiny and base models available. All of the initial feature generation convs are left as float32, but the rest of the models are quantized dynamically. The accuracy holds up (4.75% WER on LibriSpeech clean for tiny, 3.30% WER for base) and the resulting models are small. Tiny is 26MB uncompressed, and 20MB compressed with either gzip or brotli. Base is 59MB uncompressed, 45MB for gzip, and 43MB for brotli. For posterity, here are the commands I used to generate these: python shrink.py --output_suffix ".onnx" --output_dir int_activations_matmul_mul --method "integer_activations" --nodes_to_exclude "/conv1/Conv,/conv2/Conv,/conv3/Conv" original
python shrink.py --method "integer_weights" --output_suffix ".onnx" --output_dir int_activations_matmul_mul_shrunk int_activations_matmul_mul Now we have accurate quantized models that are as small as possible, I'll work with @keveman on getting this change landed. I'd be interested in seeing the q4 results too, if you can share the commands you used to generate those @xenova? |
Wow this is phenomenal work 🤯 Great stuff! Here's how to make the q4 quantizations: import onnx
from onnxruntime.quantization.matmul_4bits_quantizer import MatMul4BitsQuantizer
from optimum.onnx.graph_transformations import check_and_save_model
model = onnx.load('./path/to/model.onnx')
quantizer = MatMul4BitsQuantizer(
model=model,
# Optional:
# block_size=block_size,
# is_symmetric=is_symmetric,
# accuracy_level=accuracy_level,
)
quantizer.process()
check_and_save_model(quantizer.model.model, save_path) Would you like to upload those weights to a repo on the HF hub? Ideally, structured like https://huggingface.co/onnx-community/moonshine-base-ONNX. I can help if you'd like. |
@petewarden On a similar note, I think a model like https://huggingface.co/onnx-community/Kokoro-82M-ONNX could greatly benefit from onnx shrink ray. 👀 |
It's taken me a while to get back to this, mostly because I've been trying to hack |
Here's the HuggingFace PR for adding the files: https://huggingface.co/UsefulSensors/moonshine/discussions/6 Once that's in, I'll try to shepherd this PR in as well. |
I think this is ready to be committed now, I've asked @keveman for a final check. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code looks good, but looks like it might fail the pre submit checks, as I see a missing new line. @evmaki can you please take a look, modify if needed for style checks, and merge it?
Applied formatting and manually rebased and merged with main. |
This PR improves the moonshine-onnx package in the following ways:
Significant reductions in model size / downloads for both tiny and base models by ensuring tied weights are not duplicated, and that the decoders (w/ and w/o past key values) are merged into a single model, without any loss in precision.
New quantizations (including 4-bit and 8-bit), further reducing the size of the models with minimal differences in output. Note that the q4 quantizations only target MatMul ops, which is why the size is larger than q8 quantization.
Improved loading and execution times, as benchmarked with the following code. Note that these benchmarks do not include downloading time, only loading times (i.e., the models were already downloaded).
tiny:

base:
