-
-
Notifications
You must be signed in to change notification settings - Fork 607
Description
Due to certain reasons, I needed to use the pyTranscriber project to generate subtitles for some of my videos. However, after using pyTranscriber, the experience was not great. The main issues are:
- Excessive memory usage, often exceeding 2GB.
- Unstable and prone to crashing (the most frustrating part—running for an hour only to find it crashed, and all the conversion progress is lost, shit. )
- Too slow (often taking 2-3 hours per video).
Therefore, I created this goTranscriber. This is not meant to replace pyTranscriber but was initially just for my own use. After that development, I have a few suggestions regarding performance:
For speed:
- I used Go because I like programming with Go, but for pyTranscriber, instead of using its multi-process module, opting for asynchronous processing should significantly improve both stability and speed.
- Avoid using FLAC and choose the PCM S16LE format at 16kHz. Google's Speech-to-Text API also supports PCM format at 16kHz. PCM extraction is much faster (significantly faster) than FLAC, and I tested that FLAC doesn't improve much anyway.
- Avoid loading the entire audio file into memory at once. Instead, adopt a lazy loading approach—only load data when submitting it to the API or during the recognition process. This can significantly reduce memory usage.
By replacing FLAC with PCM S16LE format, processing a video (2 hours long) typically takes only 15-30 minutes.
For the accuracy of voice area:
Currently, pyTranscriber uses a simple RMS calculation method to determine sound intensity and identify speech regions.
However, during the development, i found that speech recognition is actually quite complex.
A better approach is to use WebRTC VAD. Although I haven't directly compared the differences between these two methods, WebRTC VAD considers more comprehensive factors and could theoretically improve accuracy to some extent.