Korean Streaming Automatic Speech Recognition

Real-time streaming Korean speech-to-text model that can run on a CPU

ASR (Automatic Speech Recognition) is a process that involves two distinct stages:

Speech Enhancement: In this stage, the incoming audio or speech signal is processed to reduce noise, improve clarity, and enhance the quality of the speech. Various techniques such as filtering, spectral subtraction, and deep learning-based methods may be employed to achieve speech enhancement. There are two main approaches for processing using deep learning techniques: waveform domain processing and spectrogram domain processing. We process waveform domain.
Speech Recognition: Once the speech signal has been enhanced, it is passed through the speech recognition system. In this stage, the system converts the processed audio into text by identifying and transcribing the spoken words. Modern ASR systems typically rely on advanced machine learning algorithms, such as deep neural networks, to accurately recognize and transcribe the speech.

Together, these two stages enable ASR systems to convert spoken language into text, making them valuable tools in various applications such as voice assistants, transcription services, and more.

We used denoiser from @facebook and @Nemo framework for conformer CTC.

Requirements

Please refer to pip.txt for the list of required dependencies

Clone

git clone https://github.com/SUNGBEOMCHOI/Korean-Streaming-ASR.git
cd Korean-Streaming-ASR

Run

File mode

python  audio_stream.py --audio_path "./audio_example/0001.wav" --device cpu

Microphone mode

python audio_stream.py --mode microphone --device cpu

Web

flask run

Example

Raw Wave(Input)

noise_bigmac.mp4

Clean Wave (enhanced by denoiser)

enhanced_bigmac.mp4

Text (output)

Datasets

We collect data from AI Hub

Stage 1 Speech Enhancement

We initialized denoiser to dns48 (H = 48, trained on DNS dataset, # of Parameters : 18,867,937) and let enhancement module dry output by $\text{dry} \cdot x + (1-\text{dry}) \cdot \hat y$ We also apply STFT Loss for training the Speech Enhancement model. We train the model on 카페,음식점 소음 & 시장, 쇼핑몰 소음 in 소음환경음성인식데이터

Stage 2 Speech to Text

Name	# of Samples(train/test)
고객응대음성	2067668/21092
한국어 음성	620000/3000
한국인 대화 음성	2483570/142399
자유대화음성(일반남녀)	1886882/263371
복지 분야 콜센터 상담데이터	1096704/206470
차량내 대화 데이터	2624132/332787
명령어 음성(노인남여)	137467/237469
Total	10916423(13946시간)/1206588(1474시간)

If you wanna more info, go to KO STT(in Hunggingface)

References

@inproceedings{defossez2020real,
  title={Real Time Speech Enhancement in the Waveform Domain},
  author={Defossez, Alexandre and Synnaeve, Gabriel and Adi, Yossi},
  booktitle={Interspeech},
  year={2020}
}

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
audio_example		audio_example
checkpoint		checkpoint
denoiser		denoiser
nemo_asr		nemo_asr
templates		templates
uploads		uploads
.gitattributes		.gitattributes
.gitignore		.gitignore
README.md		README.md
app.py		app.py
asr_data_preprocessing.py		asr_data_preprocessing.py
audio_stream.py		audio_stream.py
pip.txt		pip.txt
test.ipynb		test.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Korean Streaming Automatic Speech Recognition

Requirements

Run

Example

Datasets

References

About

Uh oh!

Releases

Packages

Languages

Refeat/Korean-Streaming-ASR

Folders and files

Latest commit

History

Repository files navigation

Korean Streaming Automatic Speech Recognition

Requirements

Run

Example

Datasets

References

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages