Skip to content

nipponjo/arabic-speech-to-text

Repository files navigation

arabic-speech-to-text

This repository contains the code for training the QuartzNet ASR model (NeMo) on the QCRI-AL Jazeera Corpus.

Data preprocessing

Download the QCRI-AL Jazeera Corpus. The script a_preprocess_xml.py extracts the text segments from the xml files. The script b_filter_ds.py removes segments that include latin script or numerals. The script c_split_ds.py creates a training set and a test set from the segments.

TODO

  • Upload pretrained model
  • ...