Skip to content

ysharma3501/LavaSR

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

19 Commits
 
 
 
 
 
 
 
 

Repository files navigation

🌋 LavaSR

Hugging Face Model   Hugging Face Space   Colab Notebook

LavaSR is a lightweight and high quality speech enhancement model that enhances low quality audio with noise into clean crisp audio with speeds reaching roughly 5000x realtime on GPU and over 60x realtime on CPU.

LavaSR v2 just released: Massive increase in quality and speed, surpassing 6gb slow diffusion models. Check it out!

lavasr_v2_cropped.mp4

Main features

  • Extremely fast: Reaches speeds over 5000x realtime on GPUs and 50x realtime on CPUs
  • High quality: Quality surpasses diffusion models.
  • Efficency: Just uses 500mb vram and potentially less.
  • Universal input: Supports any input sampling rate from 8khz to 48khz.

Why is this useful?

  • Enhancing TTS: LavaSR can enhance TTS(text-to-speech) model quality considerably with nearly 0 computational cost.
  • Real-time enhancement: LavaSR allows for on device enhancement of any low quality calls, audio, etc. while using little memory.
  • Restoring datasets: LavaSR can enhance audio quality of any audio dataset.

Comparisons

Quality comparisons using Log-Spectral-distance on VCTK validation. Lower is better(more similar to original 48khz file).

Method 8→48 kHz 16→48 kHz 24→48 kHz
Sinc upsampling 2.98 2.75 2.17
AudioSR (diffusion) 1.13 0.98 0.82
NU-WAVE2(diffusion) 1.10 0.94 0.87
AP-BWE(previous best) 0.86 0.74 0.64
Proposed model 0.85 0.72 0.63

Speed Comparisons were done on A100 gpu. Higher realtime means faster processing speeds.

Model Speed (Real-Time) Model Size
LavaSR 5000x realtime ~50 MB
AP-BWE 300x realtime ~70 MB
FlowHigh 80x realtime ~450 MB
FlashSR 14x realtime ~1000 MB
AudioSR 0.6x realtime ~6000 MB

Usage

You can try it locally, colab, or spaces.

Open In Colab Open in Spaces

Simple 1 line installation:

uv pip install git+https://github.com/ysharma3501/LavaSR.git

Load model:

from LavaSR.model import LavaEnhance2 

## change device to your torch device type(cuda, mps, etc.)
device = 'cpu'
lava_model = LavaEnhance2("YatharthS/LavaSR", device)

Simple inference

import soundfile as sf
from IPython.display import Audio

input_audio, input_sr = lava_model.load_audio('input.wav')

## Enhance Audio
output_audio = lava_model.enhance(input_audio).cpu().numpy().squeeze()

## Save Audio(both input and output)
sf.write('input.wav', input_audio.cpu().numpy().squeeze(), 16000)
sf.write('output.wav', output_audio, 48000)

Advanced inference

import soundfile as sf
from IPython.display import Audio

cutoff = None ## Default is roughly half your sampling rate. You can lower it for higher quality but might sound "metallic".
input_sr = 16000 ## Change to any sr you want(from 8khz-48khz).
denoise = False ## Change this to True only if your audio has noise you want to filter.
batch = False ## Change this to True if audio is very long.

## Load Audio
input_audio, input_sr = lava_model.load_audio('input.wav', input_sr=input_sr)

## Enhance Audio
output_audio = lava_model.enhance(input_audio, denoise=denoise, batch=batch).cpu().numpy().squeeze()

## Save Audio(both input and output)
sf.write('input.wav', input_audio.cpu().numpy().squeeze(), 16000)
sf.write('output.wav', output_audio, 48000)

Info

Q: How is this novel?

A: It adapts Vocos based architecture for BWE(bandwidth extension/audio upsampling). We also propose linkwitz-riley inspired refiner to further significantly increase quality.

Q: How is it so fast?

A: Because it uses the Vocos architecture which is isotropic and single pass, it's much faster then time-domain based and diffusion based models.

Q: What is it trained on?

A: It starts off from a Vocos prior and trained with just 50k steps on VCTK dataset. Dataset is randomly resampled to 8khz/16khz/24khz and randomly noise is added.

Roadmap

  • Release model and code
  • Huggingface spaces demo
  • Release model with no metallic issue.
  • Release training code
  • Release model trained on music and audio

Acknowledgments

  • Vocos for their excellant architecture.
  • UL-UNAS for their great denoiser model.

Final Notes

Currently writing an Interspeech paper for LavaSR, receiving feedback from community would be great.

The model and code are licensed under the Apache-2.0 license. See LICENSE for details.

Stars/Likes would be appreciated, thank you.

Email: yatharthsharma3501@gmail.com

About

🌋LavaSR: Fast Speech restoration and enhancement

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages