Subtitle fuzzying using keyword extraction? #1

setreadygo · 2019-10-22T20:20:50Z

Thanks a lot for the tool! substudy had some major pain points that ruined 80% of stuff I tried to extract (duplicate timestamps in .ass files extracted by ffmpeg). The thing I spend the most time on now is rewriting filenames.

Currently I'm rewriting each file as filename.mkv.jp.srt and then running:

ls -1 *.{mkv,avi,mp4} | parallel -j4 'ffmpeg -i {} {}.en.ass'
ls -1 *.{mp4,avi,mkv} | parallel -j16 'bunkai extract cards -m {} {}.jp.* {}.en.*'

It would be great to have some automation here. This could be done with NLP keyword extraction, I think. An example library that does this is retext-keywords or for golang RAKE. I also doubt these keyword extractors are good at picking out episode numbers though so they may need to be tweaked a bit. I've tried this kind of thing in the past using regex heuristics, but they get crazy and buggy due to edge cases. Maybe that's fine though.

Also, some pre-parsing is needed because brackets in filenames is breaking the sound field in anki. Escaping them in anki doesn't do anything and removing them causes anki to do a fuzzy search which can take 5+ seconds to find the file with large amounts of media. Almost always the stuff inside brackets is erroneous data that can be ignored. / *\[[^\]]*\] */g

Regex to remove everything within brackets, parens, and quotes:
/[ _\.\-]*(\[[^\]]*\]|\([^\)]*\)|'[^']*'|"[^"]*)[ _\-]*/g

The text was updated successfully, but these errors were encountered:

ustuehler · 2019-11-08T14:41:36Z

Sanitizing the output filenames is a good idea indeed. I would take it into consideration when I've got some time to spare on the tool again, unless you could try a patch before that. 😊

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Subtitle fuzzying using keyword extraction? #1

Subtitle fuzzying using keyword extraction? #1

setreadygo commented Oct 22, 2019 •

edited

Loading

ustuehler commented Nov 8, 2019

Subtitle fuzzying using keyword extraction? #1

Subtitle fuzzying using keyword extraction? #1

Comments

setreadygo commented Oct 22, 2019 • edited Loading

ustuehler commented Nov 8, 2019

setreadygo commented Oct 22, 2019 •

edited

Loading