Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Subtitle fuzzying using keyword extraction? #1

Open
setreadygo opened this issue Oct 22, 2019 · 1 comment
Open

Subtitle fuzzying using keyword extraction? #1

setreadygo opened this issue Oct 22, 2019 · 1 comment

Comments

@setreadygo
Copy link

setreadygo commented Oct 22, 2019

Thanks a lot for the tool! substudy had some major pain points that ruined 80% of stuff I tried to extract (duplicate timestamps in .ass files extracted by ffmpeg). The thing I spend the most time on now is rewriting filenames.

Currently I'm rewriting each file as filename.mkv.jp.srt and then running:

ls -1 *.{mkv,avi,mp4} | parallel -j4 'ffmpeg -i {} {}.en.ass'
ls -1 *.{mp4,avi,mkv} | parallel -j16 'bunkai extract cards -m {} {}.jp.* {}.en.*'

It would be great to have some automation here. This could be done with NLP keyword extraction, I think. An example library that does this is retext-keywords or for golang RAKE. I also doubt these keyword extractors are good at picking out episode numbers though so they may need to be tweaked a bit. I've tried this kind of thing in the past using regex heuristics, but they get crazy and buggy due to edge cases. Maybe that's fine though.

Also, some pre-parsing is needed because brackets in filenames is breaking the sound field in anki. Escaping them in anki doesn't do anything and removing them causes anki to do a fuzzy search which can take 5+ seconds to find the file with large amounts of media. Almost always the stuff inside brackets is erroneous data that can be ignored. / *\[[^\]]*\] */g

Regex to remove everything within brackets, parens, and quotes:
/[ _\.\-]*(\[[^\]]*\]|\([^\)]*\)|'[^']*'|"[^"]*)[ _\-]*/g

@ustuehler
Copy link
Owner

Sanitizing the output filenames is a good idea indeed. I would take it into consideration when I've got some time to spare on the tool again, unless you could try a patch before that. 😊

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants