A visualizer tool based on phonetic articulatory feature probing on wav2vec 2.0 base model.
Patrick Cormac English, Erfan A. Shams, John D. Kelleher, Julie Carson-Berndsen
Following the Embedding: Identifying Transition Phenomena in Wav2vec 2.0 Representations of Speech Audio
ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
Clone the repository using the command below:
git clone https://github.com/erfanashams/w2v2viz.git
Navigate to w2v2viz folder and install the requirements:
cd w2v2viz
pip install -r requirements.txt
Import the extended version of w2v2viz:
from w2v2viz_ext import w2v2viz_ext
Run w2v2viz using a provided wav file:
file = r"TIMIT_sample/LDC93S1.wav"
w2v2viz_ext(filename=file)
Alternately, run the test.py
for an example of how to use the module with an existing .wav file:
python test.py
We processed audio data from the TIMIT dataset using the wav2vec 2.0 model to generate embeddings for computational phonology analysis. The following steps were taken:
-
Embedding Generation: We ran the wav2vec 2.0 model on each .wav file from the TIMIT training and test datasets to produce embeddings. The model was configured to output hidden states of shape [13N768] for each file, where N represents the number of frames based on the audio length and 768 is the dimensionality of each embedding.
-
Time-Step Representation: Each [1*768] slice within the tensor corresponds to a 25ms frame of the speech signal with a 20ms stride. These time-step representations are consistent across all 12 transformer layers and the initial CNN output (layer 0).
-
Phone Labelling: Using TIMIT's time-aligned phonetic annotations, we mapped each time-step representation to its corresponding phone label. This was achieved by aligning the time-step positions in the [N*768] sequence with the TIMIT annotations.
-
Test Set Separation: We set aside 258,040 time-step samples from the TIMIT test set for evaluation purposes, as detailed in the upcoming probing task section.
For the training set, we implemented the following procedure:
-
Phone-Averaged Representations: Following a modified approach from Shah et al. (2021), we averaged the embeddings for each phone occurrence in the audio to create phone-averaged representations. This process resulted in 13 datasets (one for each layer) with 175,232 phone-averaged representations.
-
Feature Annotation: Each phone-averaged representation was annotated with feature labels for phonetic features. These annotations were derived directly from the phone labels.
The same feature annotation method was applied to the 258,040 time-step representations from the TIMIT test set.
Clone the repository using the command below:
git clone https://github.com/erfanashams/w2v2viz.git
Navigate to w2v2viz folder:
cd w2v2viz
Run the test.py for an example of how to use the module with an existing .wav file:
python test.py
The aggregation process is done by averaging phones.
Phonetic articulatory features are based on IPA charts (with ARPAbet notations) defined as follows:
phone | cat | poa | moa | voicing | back | height | rounding |
---|---|---|---|---|---|---|---|
b | con | 0 | 0 | 1 | - | - | - |
d | con | 3 | 0 | 1 | - | - | - |
g | con | 7 | 0 | 1 | - | - | - |
p | con | 0 | 0 | 0 | - | - | - |
t | con | 3 | 0 | 0 | - | - | - |
k | con | 7 | 0 | 0 | - | - | - |
dx | con | 3 | 3 | 1 | - | - | - |
q | con | 10 | 0 | 0 | - | - | - |
bcl | sil | - | - | - | - | - | - |
dcl | sil | - | - | - | - | - | - |
gcl | sil | - | - | - | - | - | - |
pcl | sil | - | - | - | - | - | - |
tcl | sil | - | - | - | - | - | - |
kcl | sil | - | - | - | - | - | - |
tck | sil | - | - | - | - | - | - |
jh | con | 4 | 4 | 1 | - | - | - |
ch | con | 4 | 4 | 0 | - | - | - |
s | con | 3 | 4 | 0 | - | - | - |
sh | con | 4 | 4 | 0 | - | - | - |
z | con | 3 | 4 | 1 | - | - | - |
zh | con | 4 | 4 | 1 | - | - | - |
f | con | 1 | 4 | 0 | - | - | - |
th | con | 2 | 4 | 0 | - | - | - |
v | con | 1 | 4 | 1 | - | - | - |
dh | con | 2 | 4 | 1 | - | - | - |
m | con | 0 | 1 | 1 | - | - | - |
n | con | 3 | 1 | 1 | - | - | - |
ng | con | 7 | 1 | 1 | - | - | - |
em | con | 0 | 1 | 1 | - | - | - |
en | con | 3 | 1 | 1 | - | - | - |
eng | con | 7 | 1 | 1 | - | - | - |
nx | con | 3 | 1 | 1 | - | - | - |
l | con | 3 | 7 | 1 | - | - | - |
r | con | 3 | 6 | 1 | - | - | - |
w | con | 1 | 6 | 1 | - | - | - |
y | con | 6 | 6 | 1 | - | - | - |
hh | con | 10 | 4 | 0 | - | - | - |
hv | con | 10 | 4 | 1 | - | - | - |
el | con | 3 | 7 | 0 | - | - | - |
iy | vow | - | - | - | 0 | 0 | 0 |
ih | vow | - | - | - | 1 | 1 | 0 |
eh | vow | - | - | - | 0 | 4 | 0 |
ey | vow | - | - | - | 0 | 2 | 0 |
ae | vow | - | - | - | 0 | 5 | 0 |
aa | vow | - | - | - | 0 | 6 | 0 |
aw | vow | - | - | - | 0 | 6 | 1 |
ay | vow | - | - | - | 0 | 6 | 0 |
ah | vow | - | - | - | 4 | 4 | 0 |
ao | vow | - | - | - | 4 | 4 | 1 |
oy | vow | - | - | - | 1 | 1 | 0 |
ow | vow | - | - | - | 3 | 1 | 1 |
uh | vow | - | - | - | 3 | 1 | 1 |
uw | vow | - | - | - | 4 | 0 | 1 |
ux | vow | - | - | - | 2 | 0 | 1 |
er | vow | - | - | - | 1 | 1 | 0 |
ax | vow | - | - | - | 2 | 3 | 0 |
ix | vow | - | - | - | 2 | 0 | 0 |
axr | vow | - | - | - | 1 | 1 | 0 |
ax-h | vow | - | - | - | 2 | 3 | 0 |
pau | sil | - | - | - | - | - | - |
epi | sil | - | - | - | - | - | - |
h# | sil | - | - | - | - | - | - |
1 | sil | - | - | - | - | - | - |
2 | sil | - | - | - | - | - | - |
@INPROCEEDINGS{10446494,
author={English, Patrick Cormac and Shams, Erfan A. and Kelleher, John D. and Carson-Berndsen, Julie},
booktitle={ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
title={Following the Embedding: Identifying Transition Phenomena in Wav2vec 2.0 Representations of Speech Audio},
year={2024},
volume={},
number={},
pages={6685-6689},
keywords={Training;Data visualization;Speech recognition;Signal processing;Transformers;Feature extraction;Vectors;Speech Recognition;Phonetic Representations;Probing;Explainable AI},
doi={10.1109/ICASSP48485.2024.10446494}}