Preparing Plenary Sessions of the Parliament of Finland dataset for use in speaker verification

The Plenary Sessions of the Parliament of Finland dataset is a sizable corpus of transcripted finnish audio. The transcriptions constitute of per word aligned annotations in EAF-format files. To convert them for more convenient form for use in speaker verification tasks it's necessary to group the annotations and split the utterances from the larger WAV files.

Caveats of the described process

Samples have not been verified for overlapping speech or misaligned timestamps.
Word grouping logic is quite rough and likely has fair bit of room for improvement.
Hashing speaker ids (tier ids) is currently quite frail, encoding and spaces affect results.
Some speakers have multiple tier ids due to additional prefixes (ministerial portfolio).
Due to 3. and 4. the speaker ids (tier ids) of the resulting files likely require manual tweaking.
There might be some finland swedish mixed in the audio.

1. Preparations

Install scripts/eaf-word2sentence dependencies:

$ python -m venv word2sentence
$ source word2sentence/bin/activate
$ pip install pympi-ling

Build elan2split, requires Boost.Filesystem and Xerces-C++: TODO: This could be replaced with a simple iteration step in scripts/eaf-word2sentence.py.

$ git clone https://github.com/vjoki/ELAN2split
$ cd elan2split/
$ mkdir build/
$ cmake ../
$ make

2. Converting the eaf per word annotations to longer groups of words.

Unpack the dataset.
Iterate through the .eaf files:

$ source word2sentence/bin/activate
$ for eaf in 2016-kevat/2016-*/*.eaf; do
    python scripts/eaf-word2sentence.py --file_path "$eaf"
    elan2split --name -o ./eduskunta/ "$eaf"
  done

Organize the files into directories per speaker with scripts/organize.sh.
Optionally take a subset of the dataset using scripts/sample_dataset.py.
Manually fix any issues with tier ids.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Files

EDUSKUNTA.md

EDUSKUNTA.md

Preparing Plenary Sessions of the Parliament of Finland dataset for use in speaker verification

Caveats of the described process

1. Preparations

2. Converting the eaf per word annotations to longer groups of words.

Files

EDUSKUNTA.md

Latest commit

History

EDUSKUNTA.md

File metadata and controls

Preparing Plenary Sessions of the Parliament of Finland dataset for use in speaker verification

Caveats of the described process

1. Preparations

2. Converting the eaf per word annotations to longer groups of words.