Skip to content

Refeat/Korean-Streaming-ASR

 
 

Repository files navigation

Korean Streaming Automatic Speech Recognition

Real-time streaming Korean speech-to-text model that can run on a CPU

ASR (Automatic Speech Recognition) is a process that involves two distinct stages:

  1. Speech Enhancement: In this stage, the incoming audio or speech signal is processed to reduce noise, improve clarity, and enhance the quality of the speech. Various techniques such as filtering, spectral subtraction, and deep learning-based methods may be employed to achieve speech enhancement. There are two main approaches for processing using deep learning techniques: waveform domain processing and spectrogram domain processing. We process waveform domain.

  2. Speech Recognition: Once the speech signal has been enhanced, it is passed through the speech recognition system. In this stage, the system converts the processed audio into text by identifying and transcribing the spoken words. Modern ASR systems typically rely on advanced machine learning algorithms, such as deep neural networks, to accurately recognize and transcribe the speech.

Together, these two stages enable ASR systems to convert spoken language into text, making them valuable tools in various applications such as voice assistants, transcription services, and more.

We used denoiser from @facebook and @Nemo framework for conformer CTC.

model_overview

Requirements

Please refer to pip.txt for the list of required dependencies

Clone

git clone https://github.com/SUNGBEOMCHOI/Korean-Streaming-ASR.git
cd Korean-Streaming-ASR

Run

File mode

python  audio_stream.py --audio_path "./audio_example/0001.wav" --device cpu

Microphone mode

python audio_stream.py --mode microphone --device cpu

Web

flask run

Example

Raw Wave(Input)

noise_bigmac.mp4

Clean Wave (enhanced by denoiser)

enhanced_bigmac.mp4

Text (output)

스트리밍 gif


Datasets

We collect data from AI Hub

Stage 1 Speech Enhancement

We initialized denoiser to dns48 (H = 48, trained on DNS dataset, # of Parameters : 18,867,937) and let enhancement module dry output by $\text{dry} \cdot x + (1-\text{dry}) \cdot \hat y$ We also apply STFT Loss for training the Speech Enhancement model. We train the model on 카페,음식점 소음 & 시장, 쇼핑몰 소음 in 소음환경음성인식데이터

Stage 2 Speech to Text

Name # of Samples(train/test)
고객응대음성 2067668/21092
한국어 음성 620000/3000
한국인 대화 음성 2483570/142399
자유대화음성(일반남녀) 1886882/263371
복지 분야 콜센터 상담데이터 1096704/206470
차량내 대화 데이터 2624132/332787
명령어 음성(노인남여) 137467/237469
Total 10916423(13946시간)/1206588(1474시간)

If you wanna more info, go to KO STT(in Hunggingface)


References

@inproceedings{defossez2020real,
  title={Real Time Speech Enhancement in the Waveform Domain},
  author={Defossez, Alexandre and Synnaeve, Gabriel and Adi, Yossi},
  booktitle={Interspeech},
  year={2020}
}

About

Korean Streaming ASR(with Denoiser and Conformer CTC)

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Jupyter Notebook 98.7%
  • Python 1.3%