Skip to content

Diarization A to Z - Kaldi to Gecko to Kaldi and corpus and back

License

Notifications You must be signed in to change notification settings

cadia-lvl/diar-az

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

89 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

diar-az

Diarization A to Z - Kaldi to Gecko to Kaldi/corpus and back

There are two goals when automating the diarization process:

  1. Adding to an existing diarization corpus
  2. Using the corpus to train diarization models through Kaldi

Csv file creation was added to Gecko to allow for anonymous and named speaker names. You need to use this version of gecko or a fork of it to create the diarization corpus type specified here.

It is assumed that the corpus data will exist within the data directory specified at the top of the create_corpus bash script.

Background

The corpus has the following format:

                       README.txt
                       reco2spk_num2spk_label.csv
                       rttm/
                       json/
                       segments/
                       wav/

The gecko archive should have the following format:

                       corrected_rttm/
                       json/
                       srt/
                       csv/

Installation

Before using the scripts in diar-az you must install the Levenshtein package pip install python-Levenshtein or if you're using conda environments: conda install python-Levenshtein

Running

In order to run all the steps do the following: bash create_corpus.sh <gecko archive> <audio directory>

bash create_corpus.sh gecko_files.zip /data/ruv_unprocessed/audio/

If you have multiple versions of srt, rttm, json, or rttm file and you know which one you want to exclude, you can do it with move_dups.sh

bash move_dups.sh filename-without-ext good-or-bad-dir

bash move_dups.sh 4882718R8_* good

to see the filenames

find data/ -maxdepth 3 -iname *4882718R8*

  • Use sort -k3,3 -t, filename.csv to look for longer name mistakes
  • Use sort -k3,3 -u -t, filename.csv if you only want one occurance of each name

Notes

Everything but the people name validation should be done by calling just one script. This script can call other scripts but the user should only have to call one. So possibly two scripts.

You do not need to concern yourself with the wav folder for this project. Assume you'll be working on directory above the corpus.

I’ve created bash and python files using gawk, sed, sort -u, sox I believe. Create the appropriate folders.

Do not commit any files or information that is specific to this corpus, e.g. names, the corpus README.

Tasks

  • 1. Add audio filenames to rttm files, as the second field. See the template file in kaldi-speaker-diarization/master/templates.md for an example. DO NOT put angle brackets arounnd the recording-id/audio filenames.
  • 2. Remove [] stuff (foreign, noise, music) from rttm files and srt segments. For rttm file that means remove the line or remove the [] portion of a line with speaker-ids as [foreign]+15. For srt segments that means only remove the segments which don't have any speech.
  • 3. Rename the rttm/json/srt files themselves to just the audio filename.
  • 4. Also include the command to call create_segments_and_text.py. It might be difficult due to where the resulting files are created. If so, then will need to generalize the python file. Do this and create a pull-request.
  • 5. Generate text file with the updated corpus numbers in the corpus readme. If know how to, then also autoreplaces the values in the readme.
  • 6. Create a csv file like in the corpus<audio-filename>,<spk-num>,<speaker label>. This involves pairing up all the written names across files then creating new speaker labels for speakers. This needs to be done with unknowns too but they also need to be renamed to the next numbered unknown available.
  • 7. Also create <audio-filename>,<spk-num>,<speaker name>,<speaker label>
  • 8. Allow there to be 1-3 spelling mistakes in the names which will then be manually validated and corrected.

Possible tasks if the above are done

If have kaldi setup the run local/make_ruvdi.sh, fix_data_dir & utils/validate_data_dir.sh

  • 1. Split each week’s files into 70/15/15% with the 70% portion holding the extra audio files.
  • 2. Run the kaldi recipe and split_rttm (I'll need to supply this file). Add them to the callhome_rttm directory.
  • 3. Run the kaldi recipe (kaldi-speaker-diarization/v4) to evaluate the new DER% with the increased data.
  • 4. Create a script which creates new segments based on 2-6 speaker turns which looks like the current corpus but with those new audio files.

TODO

  • in rttm files identify spk_ids like 001 Jane Doe, 1.[noise], + and crosstalk
  • preserve existing speaker labels
  • check for rttm files with non specified channel numbers

License

This project is licensed under Apache 2.0.

Acknowledgements

This project was funded by the the Icelandic Directorate of Labour's student summer job program in 2020.