This is an implementation of the 2D Mixture of Gaussians (MOG) model based on Toscano & McMurray (2010) which was used in my Master's Thesis (Differential Cue Weighting in Sibilants: A Case Study of Two Sinitic Languages). For more detailed information about the model and the simulation processes, check out the original papers (mentioned in the references).
To start off a simulation, first create a GMM_SGD
object by:
import GMM_SGD
gmm = GMM_SGD(
k=20,
init_mu_means=np.array([0, 0], dtype=float),
init_mu_covariances=np.array([[1.5**2, 0], [0, 1.5**2]], dtype=float),
init_covariances=np.array([[0.5**2, 0], [0, 0.5**2]], dtype=float)
)
This will immediately generate the initial distributions based on the parameters specified. The information as well as a figure of the initial Gaussians will be shown on the control panel:
There are 20 clusters.
mu_x = [ 1.701 -2.322 -1.876 1.044 0.416 -2.034 -1.759 1.301 1.981 1.65 2.738 1.936 1.089 1.683 -1.146 -0.997 0.188 0.424 -0.534 -1.647]
mu_y = [-1.074 -0.332 1.966 1.607 0.226 1.551 2.022 -0.972 -3.033 0.655 -2.275 2.038 -0.266 -2.438 -0.81 2.201 2.15 2.894 1.218 1.282]
sd_x = [0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5]
sd_y = [0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5]
rho = [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
pi = [0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05]
Next, we generate the training distributions by inputing a dataframe containing the real world dataset. The training distributions will then be fitted using the provided dataset, which should look something like the following:
data = pd.read_excel('data_all.xlsx', sheet_name='4.0')
### Data cleaning ###
data = data[~data['Onset'].isin(['ʒ'])]
data = data[~data['Vowel'].isin(['ai', 'iu', 'e', 'o'])]
data = data[~data['IPA'].isin(['ʃɿ'])]
df_TM = data[data['Language'] == 'TM']
df_SH = data[data['Language'] == 'SH']
df_NH = data[data['Language'] == 'NH']
# Remove the outliers (zscore > 3).
df_TM = df_TM[(np.abs(zscore(df_TM.iloc[:, 7:18], ddof=1)) < 3).all(axis=1)]
df_SH = df_SH[(np.abs(zscore(df_SH.iloc[:, 7:18], ddof=1)) < 3).all(axis=1)]
df_NH = df_NH[(np.abs(zscore(df_NH.iloc[:, 7:18], ddof=1)) < 3).all(axis=1)]
df_TM.head() # Here we use the TM data as an example.
Language Speaker Token Repetition IPA Onset Vowel cog sd skewness kurtosis peak F1_slope F1_intercept F2_slope F2_intercept F3_slope F3_intercept HF
0 TM CYT RND001 1.0 ɕou ɕ ou 5228 2138 0.19 -1.24 3086 0.906794 -0.967879 -0.900481 -0.373417 1.119548 -0.560756 N
1 TM CYT RND002 1.0 ʂa ʂ a 5135 1635 1.31 1.07 3882 1.307575 0.579932 -0.471833 -0.047246 -0.030565 -0.984602 N
2 TM CYT RND003 1.0 ɕa ɕ a 6073 1146 0.86 2.74 5973 1.575011 0.297072 -0.822768 0.136257 -1.761003 -0.537840 N
3 TM CYT RND004 1.0 ɕi ɕ i 6146 1434 0.70 1.43 6158 -0.172154 -1.031815 0.462426 2.188492 0.599728 1.267603 Y
4 TM CYT RND005 1.0 ʂɿ ʂ ɿ 4787 1869 0.78 -0.48 3086 0.035575 -1.136136 -1.987920 0.826618 0.098790 -0.033738 N
The input dataframe should only contain the relevant information, where in our case: the category label and the feature values for each observation. For example:
# This corresponds to the third set of simulations for learning TM (refer to page 38 of the thesis).
# Note that the first column should be the labels.
df = df_TM.loc[:, ['Onset', 'cog', 'F2_intercept', 'skewness']]
df.head(10)
Onset cog F2_intercept skewness
0 ɕ 5228 -0.373417 0.19
1 ʂ 5135 -0.047246 1.31
2 ɕ 6073 0.136257 0.86
3 ɕ 6146 2.188492 0.70
4 ʂ 4787 0.826618 0.78
5 s 8979 -0.013218 -0.58
7 ʂ 4938 -0.616531 0.62
8 s 9001 0.085482 -0.62
9 ɕ 5353 -0.183715 0.14
10 ʂ 4897 -0.773203 1.01
To generate the traning distributions, use the generate_training_distributions()
method:
gmm.generate_training_distributions(df)
Note that if there are more than two input features, principle component analysis (PCA) will first be applied in order to reduce the dimension of the feature vector to two (since our model is 2D). The fiited distribtions for individual categories will be shown on the control panel:
There are 3 training categories: ['s', 'ɕ', 'ʂ']
mu_x = [ 1.66 -0.782 -0.9 ]
mu_y = [-0.131 0.667 -0.556]
sd_x = [0.738 0.678 0.769]
sd_y = [0.737 0.979 0.745]
rho = [ 0.56 -0.125 0.066]
pi = [0.333 0.333 0.333]
To start the simulation, use the sim()
method and specify the values for each argument:
gmm.sim(n_generation=1,
n_data=100000,
lr_pi=0.001,
lr_mu=0.01,
lr_sd=0.001,
lr_rho=0.001,
threshold=0.01)
After the simulation completes, only the Gaussians with a prior probability (pi) greater than the threshold
value will be counted as stable categories and will thus be shown in the results. The model is considered "success" if it converges to the same number of categories with that of the training distributions (in this case, 3 categories).
['s', 'ɕ', 'ʂ']
Generation 1:
100000it [07:57, 209.47it/s]
Success!
training_mu_x = [ 1.66 -0.782 -0.9 ]
training_mu_y = [-0.131 0.667 -0.556]
training_sd_x = [0.738 0.678 0.769]
training_sd_y = [0.737 0.979 0.745]
training_rho = [ 0.56 -0.125 0.066]
training_pi = [0.333 0.333 0.333]
resulting_mu_x = [ 1.494 -0.807 -0.939]
resulting_mu_y = [-0.336 0.736 -0.468]
resulting_sd_x = [0.753 0.651 0.722]
resulting_sd_y = [0.77 0.956 0.764]
resulting_rho = [ 0.547 -0.145 0.089]
resulting_pi = [0.359 0.315 0.327]
Training Bhattacharyya distances: {'s-ɕ': 1.896, 's-ʂ': 1.52, 'ɕ-ʂ': 0.28}
Mean training Bhattacharyya distance: 1.232
Resulting Bhattacharyya distances: {'s-ɕ': 1.868, 's-ʂ': 1.507, 'ɕ-ʂ': 0.271}
Mean resulting Bhattacharyya distance: 1.215
If we were to manually specify the wieght for each training category, simply use the training_pi
argument, for example:
training_pi = np.array([0.45, 0.1, 0.45])
gmm.generate_training_distributions(df, training_pi=training_pi)
There are 3 training categories: ['s', 'ɕ', 'ʂ']
mu_x = [ 1.66 -0.782 -0.9 ]
mu_y = [-0.131 0.667 -0.556]
sd_x = [0.738 0.678 0.769]
sd_y = [0.737 0.979 0.745]
rho = [ 0.56 -0.125 0.066]
pi = [0.45 0.1 0.45]