Code and data for: Acoustic distances of 300 core words imply Indo-European phylogeny and chronology
Code for acoustic distance (AD) calculation between languages, AD-based evolutionary tree generation, dissimilarity metric calculation between trees, and AD-based chronology estimation; as well as data generated from these calculations on 42 Indo-European languages.
The dataset used as the example here, synthesized audio of 300 core words of 42 Indo-European languages, can be retrieved from Github or Zenodo. Please place the dataset folder acoustic-dist-ie-audio
in the parent directory of this repository.
For convenience, it is highly recommended to download the single-file archived dataset from Github or Zenodo instead of downloading the original dataset. Please put the archived dataset file audios.joblib
in the folder of this repository.
You can also create new datasets to conduct AD-based investigations of phylogeny and chronology for other languages.
Some Python libraries need to be installed for acoustic feature (AF) extraction, DTW calculation, and clustering.
View libraries
Run acoustic_distance.py
to extract AF and compute AD. AF, AF normalization method, DTW method, and AD normalization method can be selected via function parameters. Audio metadata is given in audio_metadata.py
. Resulting distance matrices will be saved as .csv
files in acoustic_distances
. When run for the first time, the archived audio dataset file audios.joblib
mentioned above will be generated (if it does not exist) in order to speed up the calculation process.
Since the calculation process is quite time-consuming, please comment out the parameters you do not want to extract in acoustic_distance.py
(Lines 36–46) to save time. Besides, AD matrices under all parameter combinations are stored in this repository in acoustic_distances
, so you can quickly skip the AD calculation process and run the following process directly.
View available AFs
BFCC
: Bark-frequency cepstral coefficientsCQCC
: Constant Q cepstral coefficientsGFCC
: Gammatone frequency cepstral coefficientsIMFCC
: Inverse mel-frequency cepstral coefficientsLFCC
: Linear frequency cepstral coefficientsLPCC
: Linear predictive cepstral coefficientsMFCC
: Mel-frequency cepstral coefficientsMSRCC
: Magnitude-based spectral root cepstral coefficientsNGCC
: Normalized gammachirp cepstral coefficientsPNCC
: Power-normalized cepstral coefficientsPSRCC
: Phase-based spectral root cepstral coefficientsRPLP
: Relative spectra perceptual linear prediction coefficientslogFBank
: Logarithmic filter bank energies
View available AF normalization methods
CMVN
: Cepstral mean and variance normalizationCMN
: Cepstral mean normalizationMMVN
: Matrix mean and variance normalizationMMN
: Matrix mean normalizationnone
: No normalization
View available DTW methods
DTW-D
: Dependent multi-dimensional DTWDTW-I
: Independent multi-dimensional DTW
View available AD normalization methods
by-sum
: DTW distance divided by the sum of the lengths of two samplesby-max
: DTW distance divided by the length of the longer samplenone
: No normalization
Run clustering.py
to generate AD-based evolutionary trees from AD matrices in acoustic_distances
. Clustering method can be selected via function parameters. Results in Newick format will be saved in trees/newicks.tsv
.
View available clustering methods
Complete
: Complete-linkage clusteringUPGMA
: Unweighted pair group method with arithmetic meanWPGMA
: Weighted pair group method with arithmetic meanUPGMC
: Unweighted pair group method with centroidWPGMC
: Weighted pair group method with centroidWard
: Ward’s minimum variance methodNJ
: Neighbor joining
R packages Quartet and TreeDist need to be installed for dissimilarity metric calculation.
Run dissimilarity_metric.r
in R to compute the Steel–Penny metric and the Robinson–Foulds metric between AD-based trees in trees/newicks.tsv
and the reference tree trees/reference_tree.nwk
. Results will be saved as trees/dissimilarity_metrics.csv
.
The reference tree here is a hierarchy of 42 Indo-European languages sourced from Glottolog accompanying the dataset.
Run fitting.py
to fit AD and date, and estimate chronology from the AD. Branch and date data of calibration and prediction points are given in fitting_data.py
. Results will be printed.
acoustic_distances/
: AD matrices under all parameter combinationstrees/newicks.tsv
: Strings of all AD-based clustering trees in Newick formattrees/reference_tree.nwk
: The reference tree in Newick formattrees/dissimilarity_metrics.csv
: Dissimilarity metrics of all trees