You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Thanks a lot for the tool! substudy had some major pain points that ruined 80% of stuff I tried to extract (duplicate timestamps in .ass files extracted by ffmpeg). The thing I spend the most time on now is rewriting filenames.
Currently I'm rewriting each file as filename.mkv.jp.srt and then running:
It would be great to have some automation here. This could be done with NLP keyword extraction, I think. An example library that does this is retext-keywords or for golang RAKE. I also doubt these keyword extractors are good at picking out episode numbers though so they may need to be tweaked a bit. I've tried this kind of thing in the past using regex heuristics, but they get crazy and buggy due to edge cases. Maybe that's fine though.
Also, some pre-parsing is needed because brackets in filenames is breaking the sound field in anki. Escaping them in anki doesn't do anything and removing them causes anki to do a fuzzy search which can take 5+ seconds to find the file with large amounts of media. Almost always the stuff inside brackets is erroneous data that can be ignored. / *\[[^\]]*\] */g
Regex to remove everything within brackets, parens, and quotes: /[ _\.\-]*(\[[^\]]*\]|\([^\)]*\)|'[^']*'|"[^"]*)[ _\-]*/g
The text was updated successfully, but these errors were encountered:
Sanitizing the output filenames is a good idea indeed. I would take it into consideration when I've got some time to spare on the tool again, unless you could try a patch before that. 😊
Thanks a lot for the tool! substudy had some major pain points that ruined 80% of stuff I tried to extract (duplicate timestamps in .ass files extracted by ffmpeg). The thing I spend the most time on now is rewriting filenames.
Currently I'm rewriting each file as filename.mkv.jp.srt and then running:
It would be great to have some automation here. This could be done with NLP keyword extraction, I think. An example library that does this is retext-keywords or for golang RAKE. I also doubt these keyword extractors are good at picking out episode numbers though so they may need to be tweaked a bit. I've tried this kind of thing in the past using regex heuristics, but they get crazy and buggy due to edge cases. Maybe that's fine though.
Also, some pre-parsing is needed because brackets in filenames is breaking the sound field in anki. Escaping them in anki doesn't do anything and removing them causes anki to do a fuzzy search which can take 5+ seconds to find the file with large amounts of media. Almost always the stuff inside brackets is erroneous data that can be ignored.
/ *\[[^\]]*\] */g
Regex to remove everything within brackets, parens, and quotes:
/[ _\.\-]*(\[[^\]]*\]|\([^\)]*\)|'[^']*'|"[^"]*)[ _\-]*/g
The text was updated successfully, but these errors were encountered: