Code and data for: Acoustic distances of 300 core words imply Indo-European phylogeny and chronology

Code for acoustic distance (AD) calculation between languages, AD-based evolutionary tree generation, dissimilarity metric calculation between trees, and AD-based chronology estimation; as well as data generated from these calculations on 42 Indo-European languages.

Audio dataset

The dataset used as the example here, synthesized audio of 300 core words of 42 Indo-European languages, can be retrieved from Github or Zenodo. Please place the dataset folder acoustic-dist-ie-audio in the parent directory of this repository.

For convenience, it is highly recommended to download the single-file archived dataset from Github or Zenodo instead of downloading the original dataset. Please put the archived dataset file audios.joblib in the folder of this repository.

You can also create new datasets to conduct AD-based investigations of phylogeny and chronology for other languages.

Code

Some Python libraries need to be installed for acoustic feature (AF) extraction, DTW calculation, and clustering.

View libraries

Acoustic distance calculation

Run acoustic_distance.py to extract AF and compute AD. AF, AF normalization method, DTW method, and AD normalization method can be selected via function parameters. Audio metadata is given in audio_metadata.py. Resulting distance matrices will be saved as .csv files in acoustic_distances. When run for the first time, the archived audio dataset file audios.joblib mentioned above will be generated (if it does not exist) in order to speed up the calculation process.

Since the calculation process is quite time-consuming, please comment out the parameters you do not want to extract in acoustic_distance.py (Lines 36–46) to save time. Besides, AD matrices under all parameter combinations are stored in this repository in acoustic_distances, so you can quickly skip the AD calculation process and run the following process directly.

View available AFs

BFCC: Bark-frequency cepstral coefﬁcients
CQCC: Constant Q cepstral coefficients
GFCC: Gammatone frequency cepstral coefficients
IMFCC: Inverse mel-frequency cepstral coefficients
LFCC: Linear frequency cepstral coefficients
LPCC: Linear predictive cepstral coefficients
MFCC: Mel-frequency cepstral coefficients
MSRCC: Magnitude-based spectral root cepstral coefﬁcients
NGCC: Normalized gammachirp cepstral coefficients
PNCC: Power-normalized cepstral coefficients
PSRCC: Phase-based spectral root cepstral coefﬁcients
RPLP: Relative spectra perceptual linear prediction coefficients
logFBank: Logarithmic filter bank energies

View available AF normalization methods

CMVN: Cepstral mean and variance normalization
CMN: Cepstral mean normalization
MMVN: Matrix mean and variance normalization
MMN: Matrix mean normalization
none: No normalization

View available DTW methods

DTW-D: Dependent multi-dimensional DTW
DTW-I: Independent multi-dimensional DTW

View available AD normalization methods

by-sum: DTW distance divided by the sum of the lengths of two samples
by-max: DTW distance divided by the length of the longer sample
none: No normalization

Evolutionary tree generation

Run clustering.py to generate AD-based evolutionary trees from AD matrices in acoustic_distances. Clustering method can be selected via function parameters. Results in Newick format will be saved in trees/newicks.tsv.

View available clustering methods

Complete: Complete-linkage clustering
UPGMA: Unweighted pair group method with arithmetic mean
WPGMA: Weighted pair group method with arithmetic mean
UPGMC: Unweighted pair group method with centroid
WPGMC: Weighted pair group method with centroid
Ward: Ward’s minimum variance method
NJ: Neighbor joining

Dissimilarity metric calculation

R packages Quartet and TreeDist need to be installed for dissimilarity metric calculation.

Run dissimilarity_metric.r in R to compute the Steel–Penny metric and the Robinson–Foulds metric between AD-based trees in trees/newicks.tsv and the reference tree trees/reference_tree.nwk. Results will be saved as trees/dissimilarity_metrics.csv.

The reference tree here is a hierarchy of 42 Indo-European languages sourced from Glottolog accompanying the dataset.

Chronology estimation

Run fitting.py to fit AD and date, and estimate chronology from the AD. Branch and date data of calibration and prediction points are given in fitting_data.py. Results will be printed.

Data

acoustic_distances/: AD matrices under all parameter combinations
trees/newicks.tsv: Strings of all AD-based clustering trees in Newick format
trees/reference_tree.nwk: The reference tree in Newick format
trees/dissimilarity_metrics.csv: Dissimilarity metrics of all trees

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
acoustic_distances		acoustic_distances
trees		trees
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
acoustic_distance.py		acoustic_distance.py
acoustic_distance_lib.py		acoustic_distance_lib.py
audio_metadata.py		audio_metadata.py
clustering.py		clustering.py
clustering_lib.py		clustering_lib.py
dissimilarity_metric.r		dissimilarity_metric.r
fitting.py		fitting.py
fitting_data.py		fitting_data.py
fitting_lib.py		fitting_lib.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Code and data for: Acoustic distances of 300 core words imply Indo-European phylogeny and chronology

Audio dataset

Code

Acoustic distance calculation

Evolutionary tree generation

Dissimilarity metric calculation

Chronology estimation

Data

About

Releases

Languages

License

EL-CL/acoustic-dist-ie

Folders and files

Latest commit

History

Repository files navigation

Code and data for: Acoustic distances of 300 core words imply Indo-European phylogeny and chronology

Audio dataset

Code

Acoustic distance calculation

Evolutionary tree generation

Dissimilarity metric calculation

Chronology estimation

Data

About

Resources

License

Stars

Watchers

Forks

Releases

Languages