GitHub - hhguo/SoCodec: Ultra-low-bitrate Speech Codec for Speech Language Modeling Applications

SoCodec: A Semantic-Ordered Multi-Stream Speech Codec for Efficient Language Model based Text-To-Speech Synthesis

Haohan Guo, Fenglong Xie, Kun Xie, Dongchao Yang, Dake Guo, Xixin Wu, Helen Meng

This repository contains inference scripts for SoCodec, an ultra-low-bitrate speech codec, dedicated to speech language models, introduced in the paper titled SoCodec: A Semantic-Ordered Multi-Stream Speech Codec for Efficient Language Model based Text-To-Speech Synthesis.

Paper
📈 Demo Site
⚙ Model Weights

👉 With SoCodec, you can compress audio into discrete codes at an ultra low 0.47 kbps bitrate and a short 120ms frameshift.
👌 It can be used as a drop-in replacement for EnCodec or other multi-stream codecs for speech language modeling applications.
📚 The released checkpoint only supports Chinese now. The training of the multi-lingual version is in progress.

News

Sep 2024 (v1.0):
- We have released the checkpoint and inference code of SoCodec

Installation

Clone the repository and install dependencies:

git clone https://github.com/hhguo/SoCodec
cd SoCodec
mkdir ckpts && cd ckpts
wget https://huggingface.co/TencentGameMate/chinese-hubert-large/resolve/main/chinese-hubert-large-fairseq-ckpt.pt
wget https://huggingface.co/hhguo/SoCodec/resolve/main/socodec_16384x4_120ms_16khz_chinese.safetensors
wget https://huggingface.co/hhguo/SoCodec/resolve/main/mel_vocoder_80dim_10ms_16khz.safetensors

Usage

# For analysis-synthesis
python example.py -i ground_truth.wav -o synthesis.wav
# For speech analysis
python example.py -i ground_truth.wav -o features.pt
# For token-to-audio synthesis
python example.py -i features.pt -o synthesis.wav

Pretrained Models

We provide the pretrained models on Hugging Face Collections.

Model Name	Frame Shift	Codebook Size	Number of Streams	Dataset
socodec_16384x4_120ms_16khz_chinese	120ms	16384	4	WenetSpeech4TTS

We also provide the pretrained vocoders to convert the Mel spectrogram from socodec to the waveform.

Model Name	Frame Shift	Mel Bins	fmax	Upsampling Ratio	Dataset
mel_vocoder_80dim_10ms_16khz	16 kHz	80	8000	160	WenetSpeech4TTS

TODO

Provide the checkpoint and inference code of multi-stream LLM
Provide the single-codebook version
Provide a higher-quality neural vocoder
Provide a multi-lingual version (Chinese, English, etc.)

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
assets		assets
configs		configs
socodec		socodec
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
example.py		example.py
ground_truth.wav		ground_truth.wav

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SoCodec: A Semantic-Ordered Multi-Stream Speech Codec for Efficient Language Model based Text-To-Speech Synthesis

Haohan Guo, Fenglong Xie, Kun Xie, Dongchao Yang, Dake Guo, Xixin Wu, Helen Meng

News

Installation

Usage

Pretrained Models

TODO

References

About

Releases

Packages

Languages

License

hhguo/SoCodec

Folders and files

Latest commit

History

Repository files navigation

SoCodec: A Semantic-Ordered Multi-Stream Speech Codec for Efficient Language Model based Text-To-Speech Synthesis

Haohan Guo, Fenglong Xie, Kun Xie, Dongchao Yang, Dake Guo, Xixin Wu, Helen Meng

News

Installation

Usage

Pretrained Models

TODO

References

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages