The Plenary Sessions of the Parliament of Finland dataset is a sizable corpus of transcripted finnish audio. The transcriptions constitute of per word aligned annotations in EAF-format files. To convert them for more convenient form for use in speaker verification tasks it's necessary to group the annotations and split the utterances from the larger WAV files.
-
Samples have not been verified for overlapping speech or misaligned timestamps.
-
Word grouping logic is quite rough and likely has fair bit of room for improvement.
-
Hashing speaker ids (tier ids) is currently quite frail, encoding and spaces affect results.
-
Some speakers have multiple tier ids due to additional prefixes (ministerial portfolio).
-
Due to 3. and 4. the speaker ids (tier ids) of the resulting files likely require manual tweaking.
-
There might be some finland swedish mixed in the audio.
- Install
scripts/eaf-word2sentence
dependencies:
$ python -m venv word2sentence
$ source word2sentence/bin/activate
$ pip install pympi-ling
- Build
elan2split
, requires Boost.Filesystem and Xerces-C++: TODO: This could be replaced with a simple iteration step inscripts/eaf-word2sentence.py
.
$ git clone https://github.com/vjoki/ELAN2split
$ cd elan2split/
$ mkdir build/
$ cmake ../
$ make
-
Unpack the dataset.
-
Iterate through the
.eaf
files:
$ source word2sentence/bin/activate
$ for eaf in 2016-kevat/2016-*/*.eaf; do
python scripts/eaf-word2sentence.py --file_path "$eaf"
elan2split --name -o ./eduskunta/ "$eaf"
done
-
Organize the files into directories per speaker with
scripts/organize.sh
. -
Optionally take a subset of the dataset using
scripts/sample_dataset.py
. -
Manually fix any issues with tier ids.