Automatic Arabic Dialect Detection Task

This code reflects the work described in the InterSpeech'2016 paper on Automatic Dialect Detection in Arabic Broadcast Speech.

It also contains a baseline system for the VarDial'2017 shared task on Arabic Dialect Identification.

Requirements

Python (tested with v.2.7.5)
Multi-class SVM (http://www.cs.cornell.edu/people/tj/svm_light/svm_multiclass.html)

Provided data:

We provide data for five Arabic dialects: Egyptian (EGY), Levantine (LAV), Gulf (GLF), North African (NOR), and Modern Standard Arabic (MSA).
The data comes from broadcast news.

VarDial'2017 shared task shared data, and features.

The baseline for VarDial'2017 is using data/train.vardial2017/ and data/dev.vardial2017/ for training and development default
For each dialect, there are two features files:
$dialect.words -- lexical features generated using LVCSR- generated using QCRI MGB-2 submission.
$dialect.ivec -- i-vector based on bottleneck features, with a fixed length of 400 per utterance.
wav.lst -- link to the original audio files; WAVE audio, Microsoft PCM, 16 bit, mono 16000 Hz.
Baseline-- bottleneck iVectors 57.28% accuracy and lexical features 48.43%.

InterSpeech'2016 paper shared data.

To reproduce the results in InterSpeech'2016, the script should point to data/train.IS2016/ and data/test.IS2016/ for training and testing.
$dialect.words -- lexical features generated using LVCSR;
$dialect.ivec -- i-vector based on bottleneck features, with a fixed length of 400 per utterance.
$dialect.phones -- phoneme sequence from an automatic phoneme recognition system.
$dialect.phone_duration -- phoneme sequence, and the duration in milliseconds for each phone, e.g., w_030 means phone w for 30 milliseconds.

Sample code

Run 'run.sh' for an example of the code and the data

features=phones -- you can use words, phones or ivectors;
context=6 -- for some features, less context might be enough;
NOTE 1: The regularization parameters can be optimized for better performance.
NOTE 2: System combination can be explored as well.

Citing

This data and the baseline system are described in this paper:

@inproceedings{ali2016automatic,
  author={Ali, Ahmed and Dehak, Najim and Cardinal, Patrick and Khurana, Sameer and Yella, Sree Harsha and Glass, James and Bell, Peter and Renals, Steve},
  title={Automatic Dialect Detection in Arabic Broadcast Speech},
  booktitle={Interspeech},
  address={San Francisco, CA, USA}
  pages={2934--2938},
  year={2016}
}

Name		Name	Last commit message	Last commit date
Latest commit History 27 Commits
.ipynb_checkpoints		.ipynb_checkpoints
data		data
scripts		scripts
3d.py		3d.py
Figure_1-1.png		Figure_1-1.png
LICENSE		LICENSE
README.md		README.md
Untitled.ipynb		Untitled.ipynb
basic_animation.mp4		basic_animation.mp4
dialect_plots.py		dialect_plots.py
environment.yaml		environment.yaml
gif_mp4_playground.py		gif_mp4_playground.py
hypothesis		hypothesis
model		model
reference		reference
resources		resources
run.sh		run.sh
scatter3d_demo.py		scatter3d_demo.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Automatic Arabic Dialect Detection Task

Requirements

Provided data:

Sample code

Citing

About

Releases

Packages

Languages

License

qcri/dialectID

Folders and files

Latest commit

History

Repository files navigation

Automatic Arabic Dialect Detection Task

Requirements

Provided data:

Sample code

Citing

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages