Real-time streaming Korean speech-to-text model that can run on a CPU
ASR (Automatic Speech Recognition) is a process that involves two distinct stages:
-
Speech Enhancement: In this stage, the incoming audio or speech signal is processed to reduce noise, improve clarity, and enhance the quality of the speech. Various techniques such as filtering, spectral subtraction, and deep learning-based methods may be employed to achieve speech enhancement. There are two main approaches for processing using deep learning techniques: waveform domain processing and spectrogram domain processing. We process waveform domain.
-
Speech Recognition: Once the speech signal has been enhanced, it is passed through the speech recognition system. In this stage, the system converts the processed audio into text by identifying and transcribing the spoken words. Modern ASR systems typically rely on advanced machine learning algorithms, such as deep neural networks, to accurately recognize and transcribe the speech.
Together, these two stages enable ASR systems to convert spoken language into text, making them valuable tools in various applications such as voice assistants, transcription services, and more.
We used denoiser from @facebook and @Nemo framework for conformer CTC.
Please refer to pip.txt for the list of required dependencies
Clone
git clone https://github.com/SUNGBEOMCHOI/Korean-Streaming-ASR.git
cd Korean-Streaming-ASRFile mode
python audio_stream.py --audio_path "./audio_example/0001.wav" --device cpuMicrophone mode
python audio_stream.py --mode microphone --device cpuWeb
flask run
Raw Wave(Input)
noise_bigmac.mp4
Clean Wave (enhanced by denoiser)
enhanced_bigmac.mp4
Text (output)
We collect data from AI Hub
Stage 1 Speech Enhancement
We initialized denoiser to dns48 (H = 48, trained on DNS dataset, # of Parameters : 18,867,937) and let enhancement module dry output by
Stage 2 Speech to Text
| Name | # of Samples(train/test) |
|---|---|
| 고객응대음성 | 2067668/21092 |
| 한국어 음성 | 620000/3000 |
| 한국인 대화 음성 | 2483570/142399 |
| 자유대화음성(일반남녀) | 1886882/263371 |
| 복지 분야 콜센터 상담데이터 | 1096704/206470 |
| 차량내 대화 데이터 | 2624132/332787 |
| 명령어 음성(노인남여) | 137467/237469 |
| Total | 10916423(13946시간)/1206588(1474시간) |
If you wanna more info, go to KO STT(in Hunggingface)
@inproceedings{defossez2020real,
title={Real Time Speech Enhancement in the Waveform Domain},
author={Defossez, Alexandre and Synnaeve, Gabriel and Adi, Yossi},
booktitle={Interspeech},
year={2020}
}
