Applying state-of-the-art sentence alignment tools to subtitle extraction and alignment, achieving a substantial improvement in subtitle alignment quality. Leveraging sentence embeddings, dynamic programming, cosine similarity, and partitioning we attained F1 scores exceeding 93% and estimate an overall improvement of 31% based on other subtitle alignment techniques.
There are gold alignments for 5 titles in the gold
directory. The alignments can be found within each subdirectory with names like eng-spa-gold.txt
and eng-ger-gold.txt
. The subtitles themselves are in the sub-sub dirs eng
, spa
, ger
, etc.
There is a curses and python implementation of an annotation tool. After you run scripts/run_vecalign.py
on the title you want to annotate, it will load the alignments generated by that script into a vim-like editor where you can approve or edit the alignments. This tool supports the following operations:
Key | Action |
---|---|
d | Delete current alignment. |
e | Edit current alignment. Will open the current alignment in Vim. |
u | Union (merge) current subtitle with the following subtitle |
s | Split alignment into two. This will actually duplicate the current alignment allowing you to edit it and the subsequent (duplicate). Ideal for splitting alignments when multiple sentencese have been merged together. |
w | Write (save) all alignments including those that have not yet been reviewed. |
n | Move to Next alignment. |
p | Move to Previous alignment. |
