LavaSR is a lightweight and high quality speech enhancement model that enhances low quality audio with noise into clean crisp audio with speeds reaching roughly 5000x realtime on GPU and over 60x realtime on CPU.
LavaSR v2 just released: Massive increase in quality and speed, surpassing 6gb slow diffusion models. Check it out!
lavasr_v2_cropped.mp4
- Extremely fast: Reaches speeds over 5000x realtime on GPUs and 50x realtime on CPUs
- High quality: Quality surpasses diffusion models.
- Efficency: Just uses 500mb vram and potentially less.
- Universal input: Supports any input sampling rate from 8khz to 48khz.
- Enhancing TTS: LavaSR can enhance TTS(text-to-speech) model quality considerably with nearly 0 computational cost.
- Real-time enhancement: LavaSR allows for on device enhancement of any low quality calls, audio, etc. while using little memory.
- Restoring datasets: LavaSR can enhance audio quality of any audio dataset.
Quality comparisons using Log-Spectral-distance on VCTK validation. Lower is better(more similar to original 48khz file).
| Method | 8→48 kHz | 16→48 kHz | 24→48 kHz |
|---|---|---|---|
| Sinc upsampling | 2.98 | 2.75 | 2.17 |
| AudioSR (diffusion) | 1.13 | 0.98 | 0.82 |
| NU-WAVE2(diffusion) | 1.10 | 0.94 | 0.87 |
| AP-BWE(previous best) | 0.86 | 0.74 | 0.64 |
| Proposed model | 0.85 | 0.72 | 0.63 |
Speed Comparisons were done on A100 gpu. Higher realtime means faster processing speeds.
| Model | Speed (Real-Time) | Model Size |
|---|---|---|
| LavaSR | 5000x realtime | ~50 MB |
| AP-BWE | 300x realtime | ~70 MB |
| FlowHigh | 80x realtime | ~450 MB |
| FlashSR | 14x realtime | ~1000 MB |
| AudioSR | 0.6x realtime | ~6000 MB |
You can try it locally, colab, or spaces.
uv pip install git+https://github.com/ysharma3501/LavaSR.git
from LavaSR.model import LavaEnhance2
## change device to your torch device type(cuda, mps, etc.)
device = 'cpu'
lava_model = LavaEnhance2("YatharthS/LavaSR", device)import soundfile as sf
from IPython.display import Audio
input_audio, input_sr = lava_model.load_audio('input.wav')
## Enhance Audio
output_audio = lava_model.enhance(input_audio).cpu().numpy().squeeze()
## Save Audio(both input and output)
sf.write('input.wav', input_audio.cpu().numpy().squeeze(), 16000)
sf.write('output.wav', output_audio, 48000)import soundfile as sf
from IPython.display import Audio
cutoff = None ## Default is roughly half your sampling rate. You can lower it for higher quality but might sound "metallic".
input_sr = 16000 ## Change to any sr you want(from 8khz-48khz).
denoise = False ## Change this to True only if your audio has noise you want to filter.
batch = False ## Change this to True if audio is very long.
## Load Audio
input_audio, input_sr = lava_model.load_audio('input.wav', input_sr=input_sr)
## Enhance Audio
output_audio = lava_model.enhance(input_audio, denoise=denoise, batch=batch).cpu().numpy().squeeze()
## Save Audio(both input and output)
sf.write('input.wav', input_audio.cpu().numpy().squeeze(), 16000)
sf.write('output.wav', output_audio, 48000)Q: How is this novel?
A: It adapts Vocos based architecture for BWE(bandwidth extension/audio upsampling). We also propose linkwitz-riley inspired refiner to further significantly increase quality.
Q: How is it so fast?
A: Because it uses the Vocos architecture which is isotropic and single pass, it's much faster then time-domain based and diffusion based models.
Q: What is it trained on?
A: It starts off from a Vocos prior and trained with just 50k steps on VCTK dataset. Dataset is randomly resampled to 8khz/16khz/24khz and randomly noise is added.
- Release model and code
- Huggingface spaces demo
- Release model with no metallic issue.
- Release training code
- Release model trained on music and audio
Currently writing an Interspeech paper for LavaSR, receiving feedback from community would be great.
The model and code are licensed under the Apache-2.0 license. See LICENSE for details.
Stars/Likes would be appreciated, thank you.
Email: yatharthsharma3501@gmail.com