I wanted to see how good (or not) Whisper is both in terms of AIQ, and easy of use. Whisper is OpenAI's newly released ASR implementation which is open-sourced. ✌️
I decided to use Sam's TWIML AI Podcast as the test bed. 👌
There are a few steps to get this going:
- You need to install all the dependencies
- If using a GPU make sure it is properly configured for your OS implementation
- You need to install whisper
- We download and save (as mp3) all the episodes from YouTube where the podcasts are published
- We use whisper to run through each of these episodes and transcribe them - saving three files for each episode:
- Text file - this contains the STT (speech to text) transcription
- VTT file - This is a WebVTT (Web Video Text Tracks), also known as a WebSRT, and is a time-indexed file format used for synchronized video caption playback
- SRT file - This a SubRip Subtitle file - essentially it is subtitle information including start and end time-stamps and associated sequential number of subtitles.
If you just want the transcribed files, at the time of writing this there were 547 published episodes that I have all transcribed. These were done using the base model form Whisper and can be found in the 📁 twiml-episodes-whisper-transcribed.
- You can download all of the files as one zip file too -- 🗃️ twiml-episodes-whispered-transcribed.zip
There are a few things that are needed to get whisper deployed and running locally. The first is that you have a local GPU that can support CUDA. At a high level, the OS doesn't matter as long as the CUDA support is there. In my case I ran this on WSL2 with Ubuntu. :writing_hand: TODO: Code already checked-in; need to outline details here.
✍️ Code already checked-in and hopefully a lot of it is self-explanatory; but if you need details - check out the blog: https://blog.desigeek.com/post/2023/02/openai-whisper-overview/