VM-ASR

VM-ASR: A Lightweight Dual-Stream U-Net Model for Efﬁcient Audio Super-Resolution

The official PyTorch implementation of VM-ASR, a model designed for high-fidelity audio super-resolution.
This research has been published in the IEEE/ACM Transactions on Audio, Speech, and Language Processing on 24 January 2025. Paper, Demo.

Abstract

Audio super-resolution (ASR), also known as bandwidth extension (BWE), aims to enhance the quality of low-resolution audio by recovering high-frequency components. However, existing methods often struggle to model harmonic relationships accurately and balance the inference speed and computational complexity. In this paper, we propose VM-ASR, a novel lightweight ASR model that leverages the Visual State Space (VSS) block to effectively capture global and local contextual information within audio spectrograms. This enables VM-ASR to model harmonic relationships more accurately, improving audio quality. Our experiments on the VCTK dataset demonstrate that VM-ASR consistently outperforms state-of-the-art methods in spectral reconstruction across various input-output sample rate pairs, achieving significantly lower Log-Spectral Distance (LSD) while maintaining a smaller model size (3.01M parameters) and lower computational complexity (2.98 GFLOPS). This makes VM-ASR not only a promising solution for real-time applications and resource-constrained environments but also opens up exciting possibilities in telecommunications, speech synthesis, and audio restoration.

Requirements

Python 3.10+
Conda 23.5.0
CUDA 12.0 / 11.7 (optional)
PyTorch 2.3.0 (CUDA 12.0) / 2.0.1 (CUDA 11.7)
torchaudio 2.3.0 (CUDA 12.0) / 2.0.2 (CUDA 11.7)
NVIDIA GPU

Note: When developing, the model is trained on a single NVIDIA GeForce RTX 4060 Ti GPU with 16GB memory locally. All the weights and experiment data are trained and evaluated on a single NVIDIA Tesla V100 GPU with 32GB memory provided by the National Center for High-performance Computing (NCHC) of National Applied Research Laboratories (NARLabs) in Taiwan. We did not implement the multi-GPU training in this repository.

Installation

# 1. Clone the repository
$ git clone https://github.com/ghnmqdtg/VM-ASR

# 2. Navigate to the project directory
$ cd VM-ASR

# 3. Create conda env
$ conda create --name vm-asr python=3.10

# 4. Activate the env
$ conda activate vm-asr
    
# 5A. Install PyTorch 2.3.0 (CUDA 12.0)
$ conda install pytorch==2.3.0 torchvision==0.18.0 torchaudio==2.3.0 pytorch-cuda=12.1 -c pytorch -c nvidia
# 5B. Install PyTorch 2.0.1 (CUDA 11.7)
$ conda install pytorch==2.0.1 torchvision==0.15.2 torchaudio==2.0.2 pytorch-cuda=11.7 -c pytorch -c nvidia

6. Install the package
$ pip install -r requirements.txt

# 7. Install the selective scan kernel (Source: https://github.com/MzeroMiko/VMamba)
$ cd kernels/selective_scan && pip install . && cd ../..

Usage

The training and evaluation commands are provided in the script files. You can run the training and evaluation scripts using the following commands:

Training

# 16kHz target sample rate
$ sh slurm_train_16kHz.sh

# 48kHz target sample rate
$ sh slurm_train_48kHz.sh

Evaluation

# 16kHz target sample rate
$ sh slurm_test_16kHz.sh

# 48kHz target sample rate
$ sh slurm_test_48kHz.sh

Pretrained Models

Please check the release page for the pretrained models. To use the pretrained models, you can download them and decompress them to the ./logs/DualStreamInteractiveMambaUNet directory. For example, logs/DualStreamInteractiveMambaUNet/16k_2k_FullData_MPD. The pretrained models are trained on the VCTK dataset with the following configurations:

Target SR (kHz)	Input SR (kHz)	Model ID	Description
Specialized Models
16	2	`16k_2k_FullData_MPD`	2 to 16kHz
	4	`16k_4k_FullData_MPD`	4 to 16kHz
	8	`16k_8k_FullData_MPD`	8 to 16kHz
	12	`16k_12k_FullData_MPD`	12 to 16kHz
48	8	`48k_8k_FullData_MPD`	8 to 48kHz
	12	`48k_12k_FullData_MPD`	12 to 48kHz
	16	`48k_16k_FullData_MPD`	16 to 48kHz
	24	`48k_24k_FullData_MPD`	24 to 48kHz
Versatile Models
16	2~16	`16k_FullData_MPD`	2~16 to 16kHz
48	8~48	`48k_FullData_MPD`	8~48 to 48kHz
Ablation Study
48	16	`48k_16k_FullData`	GAN (X) & Post Processing (O)
		`48k_16k_FullData_MPD_woPost`	GAN (O) & Post Processing (X)
		`48k_16k_FullData_woPost`	GAN (X) & Post Processing (X)
		`48k_16k_FullData_MPD_M2P`	Interactions: Magnitude to Phase
		`48k_16k_FullData_MPD_P2M`	Interactions: Phase to Magnitude

Q&A

Can I run the training commands directly?

Yes, of course. You can run the training commands directly without using the script files like this:

Usage: python main.py [OPTIONS]

Options:
    --cfg FILE                   path to config file [required]
    --batch-size INT             batch size for single GPU
    --input_sr INT               the input sample rate (if set, the random resample will be disabled)
    --resume PATH                path to checkpoint for models to resume training or run evaluation
    --disable_amp                Disable pytorch automatic mixed precision
    --output PATH                root of output folder, the full path is <output>/<model_name>/<tag>
    --tag TAG                    tag of experiment [default: current timestamp]
    --eval                       Perform evaluation only

Why do I get an error message like : not found_16kHz.sh: 2:?

Please note that the end-of-line sequence in the script files should be LF instead of CRLF. If you encounter this error message, please check the line endings of the script files first.
What's the model training by default?

The script files used to train and evaluate the versatile models (VM-ASR*) by default. If you want to train the specialized models (VM-ASR), you can uncomment the corresponding lines in the script files. The description of commands is provided in the script files.
Waaait a minute, where is the dataset?

After you run the training script, the system will check if the dataset exists. If it is not found, the system will download, decompress, and convert it to WAV files automatically. If it is found, the system will skip such steps. Please note that the dataset is large (about 35GB, zip, FLAC, and WAV included), so the process may take some time.

Configuration

Configuration Files

The default configuration is provided in the ./config.py file. We use this default configuration as the base configuration, and we override it with the configuration file in .yaml format for each experiment. These yaml files are located at ./config.

Some frequently used hyperparameter you may want to adjust in yaml files:

DATA.BATCH_SIZE

The batch size in configs are set for 32GB memory GPU. If you have a different GPU with smaller memory, you may need to adjust the batch size accordingly.
DATA.USE_QUANTITY

This controls how much data you want to use for training and evaluation. The default value is 1.0, which means using all the data. You can adjust this value to a smaller number like 0.1 for debugging new features.
TRAIN.ADVERSARIAL.ENABLE

This controls whether to enable the adversarial training. The default value is False. You can set it to True to enable the adversarial training.

Logging

Tensorboard

Tensorboard logging is enabled. The logs are saved in the ./logs directory, and you can visualize them using the following command:
```
# Launch Tensorboard
$ tensorboard --logdir logs/
```
Weights and Biases (WandB)

Weights and Biases (Wandb) logging is used for syncing the training process online. It's enabled by default. You need to login to your WandB account for the first time running the training script. You can disable it in script files by setting WANDB.ENABLE to False if you don't want to use it.

Citation & Contact

Citation

@ARTICLE{10852332,
  author={Zhang, Ting-Wei and Ruan, Shanq-Jang},
  journal={IEEE Transactions on Audio, Speech and Language Processing}, 
  title={VM-ASR: A Lightweight Dual-Stream U-Net Model for Efficient Audio Super-Resolution}, 
  year={2025},
  volume={33},
  number={},
  pages={666-677},
  doi={10.1109/TASLPRO.2025.3533365}
}

Contact

If you find any bugs or have any suggestions, please feel free to open an issue or submit a pull request. You can also email us at ghmmqdtg@gmail.com.

Acknowledgements

We thank to National Center for High-performance Computing (NCHC) of National Applied Research Laboratories (NARLabs) in Taiwan for providing computational and storage resources.

Name		Name	Last commit message	Last commit date
Latest commit History 233 Commits
assets		assets
base		base
configs		configs
data_loader		data_loader
kernels/selective_scan		kernels/selective_scan
logger		logger
model		model
trainer		trainer
utils		utils
vctk-silence-labels @ 4cd4092		vctk-silence-labels @ 4cd4092
.gitignore		.gitignore
.gitmodules		.gitmodules
README.md		README.md
config.py		config.py
main.py		main.py
quit_wandb.sh		quit_wandb.sh
requirements.txt		requirements.txt
results_48kHz.csv		results_48kHz.csv
slurm_test_16kHz.sh		slurm_test_16kHz.sh
slurm_test_48kHz.sh		slurm_test_48kHz.sh
slurm_train_16kHz.sh		slurm_train_16kHz.sh
slurm_train_48kHz.sh		slurm_train_48kHz.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

VM-ASR

Abstract

Table of Contents

Requirements

Installation

Usage

Training

Evaluation

Pretrained Models

Q&A

Configuration

Configuration Files

Logging

Citation & Contact

Citation

Contact

Acknowledgements

About

Releases

Packages

Languages

ghnmqdtg/VM-ASR

Folders and files

Latest commit

History

Repository files navigation

VM-ASR

Abstract

Table of Contents

Requirements

Installation

Usage

Training

Evaluation

Pretrained Models

Q&A

Configuration

Configuration Files

Logging

Citation & Contact

Citation

Contact

Acknowledgements

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages