Skip to content

Code and data for: Acoustic distances of 300 core words imply Indo-European phylogeny and chronology


Notifications You must be signed in to change notification settings


Repository files navigation

Code and data for: Acoustic distances of 300 core words imply Indo-European phylogeny and chronology

Code for acoustic distance (AD) calculation between languages, AD-based evolutionary tree generation, dissimilarity metric calculation between trees, and AD-based chronology estimation; as well as data generated from these calculations on 42 Indo-European languages.

Audio dataset

The dataset used as the example here, synthesized audio of 300 core words of 42 Indo-European languages, can be retrieved from Github or Zenodo. Please place the dataset folder acoustic-dist-ie-audio in the parent directory of this repository.

For convenience, it is highly recommended to download the single-file archived dataset from Github or Zenodo instead of downloading the original dataset. Please put the archived dataset file audios.joblib in the folder of this repository.

You can also create new datasets to conduct AD-based investigations of phylogeny and chronology for other languages.


Some Python libraries need to be installed for acoustic feature (AF) extraction, DTW calculation, and clustering.

View libraries

Acoustic distance calculation

Run to extract AF and compute AD. AF, AF normalization method, DTW method, and AD normalization method can be selected via function parameters. Audio metadata is given in Resulting distance matrices will be saved as .csv files in acoustic_distances. When run for the first time, the archived audio dataset file audios.joblib mentioned above will be generated (if it does not exist) in order to speed up the calculation process.

Since the calculation process is quite time-consuming, please comment out the parameters you do not want to extract in (Lines 36–46) to save time. Besides, AD matrices under all parameter combinations are stored in this repository in acoustic_distances, so you can quickly skip the AD calculation process and run the following process directly.

View available AFs
  • BFCC: Bark-frequency cepstral coefficients
  • CQCC: Constant Q cepstral coefficients
  • GFCC: Gammatone frequency cepstral coefficients
  • IMFCC: Inverse mel-frequency cepstral coefficients
  • LFCC: Linear frequency cepstral coefficients
  • LPCC: Linear predictive cepstral coefficients
  • MFCC: Mel-frequency cepstral coefficients
  • MSRCC: Magnitude-based spectral root cepstral coefficients
  • NGCC: Normalized gammachirp cepstral coefficients
  • PNCC: Power-normalized cepstral coefficients
  • PSRCC: Phase-based spectral root cepstral coefficients
  • RPLP: Relative spectra perceptual linear prediction coefficients
  • logFBank: Logarithmic filter bank energies
View available AF normalization methods
  • CMVN: Cepstral mean and variance normalization
  • CMN: Cepstral mean normalization
  • MMVN: Matrix mean and variance normalization
  • MMN: Matrix mean normalization
  • none: No normalization
View available DTW methods
  • DTW-D: Dependent multi-dimensional DTW
  • DTW-I: Independent multi-dimensional DTW
View available AD normalization methods
  • by-sum: DTW distance divided by the sum of the lengths of two samples
  • by-max: DTW distance divided by the length of the longer sample
  • none: No normalization

Evolutionary tree generation

Run to generate AD-based evolutionary trees from AD matrices in acoustic_distances. Clustering method can be selected via function parameters. Results in Newick format will be saved in trees/newicks.tsv.

View available clustering methods
  • Complete: Complete-linkage clustering
  • UPGMA: Unweighted pair group method with arithmetic mean
  • WPGMA: Weighted pair group method with arithmetic mean
  • UPGMC: Unweighted pair group method with centroid
  • WPGMC: Weighted pair group method with centroid
  • Ward: Ward’s minimum variance method
  • NJ: Neighbor joining

Dissimilarity metric calculation

R packages Quartet and TreeDist need to be installed for dissimilarity metric calculation.

Run dissimilarity_metric.r in R to compute the Steel–Penny metric and the Robinson–Foulds metric between AD-based trees in trees/newicks.tsv and the reference tree trees/reference_tree.nwk. Results will be saved as trees/dissimilarity_metrics.csv.

The reference tree here is a hierarchy of 42 Indo-European languages sourced from Glottolog accompanying the dataset.

Chronology estimation

Run to fit AD and date, and estimate chronology from the AD. Branch and date data of calibration and prediction points are given in Results will be printed.



Code and data for: Acoustic distances of 300 core words imply Indo-European phylogeny and chronology







No releases published