This repository contains a Python script, match_transcript.py, that automatically:
- Extracts audio from video files,
- Transcribes both video-extracted audio and standalone audio files using whisper.cpp,
- Compares transcriptions to find overlapping spoken content (even if audio is longer than video),
- Copies matched video and audio files into organized folders.
This approach is useful when you have a folder of video recordings and a separate folder of audio recordings and need to figure out which audio files (possibly very long) overlap with which video files.
- Automatic Audio Extraction: Uses FFmpeg to create a
.wav(16 kHz, mono) from each video file. - Speech-to-Text with Whisper: Calls
whisper.cppto transcribe.wavfiles, skipping re-transcription if a.txtfile already exists. - Partial Overlap Support: Incorporates a fuzzy matching approach (or optional chunk-based approach) so a long audio file can match multiple short video clips if content overlaps.
- Many-to-One and One-to-Many: A single audio file can match multiple video files, copying the same audio file into multiple match folders.
- Cleanup: Removes intermediate
_16k_mono.wavfiles after transcription to avoid clutter. - Customizable Similarity: By default, uses a 0.6 threshold for “good enough” spoken overlap.
- Python 3.7+
- FFmpeg
- On macOS:
brew install ffmpeg
- On Linux (Ubuntu/Debian):
sudo apt-get update sudo apt-get install ffmpeg
- On macOS:
- whisper.cpp
- The script assumes you have
maincompiled and available inwhisper.cpp. - Example:
git clone https://github.com/ggerganov/whisper.cpp cd whisper.cpp make - Place your model (e.g.,
ggml-small.bin) inwhisper.cpp/models/or update the script’smodelpath.
- The script assumes you have
- Optional (for partial matching):
pip install rapidfuzz(orpip install fuzzywuzzy) if you’d like to use a partial ratio for more robust substring-based matching (see Advanced Usage).
Below is an example setup:
my_repo/
├─ match_transcript.py
├─ README.md
├─ whisper.cpp/ # (cloned & compiled whisper)
├─ 02_Media/
│ └─ 01_Video/
│ └─ 01_Raw/
│ └─ match_test/
│ ├─ video1.mp4
│ ├─ video2.mov
│ ├─ audio/
│ │ ├─ audio1.wav
│ │ └─ audio2.WAV
│ ├─ extracted_audio/ # auto-generated for video-extracted wav
│ └─ output/ # matched results placed here
In the script, you can update these paths (e.g., video_folder, audio_folder, output_folder, etc.) to match your directory structure.
-
Clone & Setup:
git clone https://github.com/<your-username>/my_repo.git cd my_repo
Ensure
match_transcript.pyis in the same folder as thisREADME.md, and thatwhisper.cppis compiled. -
Install Dependencies (optional for chunk-based matching or fuzzy partial):
pip install rapidfuzz
-
Edit Paths:
Openmatch_transcript.pyand update:video_folder = "/path/to/video/folder" audio_folder = "/path/to/audio/folder" output_folder = "/path/to/output/folder" extracted_audio_folder = "/path/to/extracted_audio" whisper_dir = "/path/to/whisper.cpp"
so they reflect your actual directories.
-
Run the Script:
python match_transcript.py
- The script:
- Converts
.mp4or.movfiles to.wavinextracted_audio_folder. - Converts all
.wav(both from videos and standalone) to 16 kHz mono_16k_mono.wav. - Transcribes them with
whisper.cpp. - Compares transcriptions above threshold (0.6) to produce a list of matches.
- Copies each match (video + audio) to a subfolder in
output_folder.
- Converts
- The script:
Files:
match_test/video1.mp4match_test/audio/audio1.wav
Run:
python match_transcript.pyOutput:
- Creates a
video1.wavinextracted_audio_folder. - Transcribes
video1.wavandaudio1.wav. - If overlap is detected, places copies of
video1.mp4andaudio1.wavinmatch_test/output/video1.
Files:
video1.mp4video2.movaudio/long_interview.wav
If the script finds partial overlap for both videos in long_interview.wav, it will create:
output/
├─ video1/
│ ├─ video1.mp4
│ └─ long_interview.wav
├─ video2/
│ ├─ video2.mov
│ └─ long_interview.wav
If your audio is significantly larger, you can improve partial matching by using RapidFuzz partial_ratio:
from rapidfuzz import fuzz
def compare_transcriptions(video_transcriptions, audio_transcriptions):
matches = []
for video_file, video_text in video_transcriptions.items():
found_any = False
for audio_file, audio_text in audio_transcriptions.items():
# partial_ratio for substring-based matching
similarity = fuzz.partial_ratio(video_text, audio_text) / 100.0
if similarity > 0.6:
matches.append((video_file, audio_file, similarity))
found_any = True
if not found_any:
print(f"No match found for {video_file}.")
return matches