Audio-Video Matching with Whisper & FFmpeg

This repository contains a Python script, match_transcript.py, that automatically:

Extracts audio from video files,
Transcribes both video-extracted audio and standalone audio files using whisper.cpp,
Compares transcriptions to find overlapping spoken content (even if audio is longer than video),
Copies matched video and audio files into organized folders.

This approach is useful when you have a folder of video recordings and a separate folder of audio recordings and need to figure out which audio files (possibly very long) overlap with which video files.

Features

Automatic Audio Extraction: Uses FFmpeg to create a .wav (16 kHz, mono) from each video file.
Speech-to-Text with Whisper: Calls whisper.cpp to transcribe .wav files, skipping re-transcription if a .txt file already exists.
Partial Overlap Support: Incorporates a fuzzy matching approach (or optional chunk-based approach) so a long audio file can match multiple short video clips if content overlaps.
Many-to-One and One-to-Many: A single audio file can match multiple video files, copying the same audio file into multiple match folders.
Cleanup: Removes intermediate _16k_mono.wav files after transcription to avoid clutter.
Customizable Similarity: By default, uses a 0.6 threshold for “good enough” spoken overlap.

Requirements

Python 3.7+

FFmpeg

On macOS:
```
brew install ffmpeg
```

On Linux (Ubuntu/Debian):

sudo apt-get update
sudo apt-get install ffmpeg

whisper.cpp
- The script assumes you have main compiled and available in whisper.cpp.
- Example:
```
git clone https://github.com/ggerganov/whisper.cpp
cd whisper.cpp
make
```
- Place your model (e.g., ggml-small.bin) in whisper.cpp/models/ or update the script’s model path.
Optional (for partial matching):
- pip install rapidfuzz (or pip install fuzzywuzzy) if you’d like to use a partial ratio for more robust substring-based matching (see Advanced Usage).

Folder Structure & Script Configuration

Below is an example setup:

my_repo/
├─ match_transcript.py
├─ README.md
├─ whisper.cpp/         # (cloned & compiled whisper)
├─ 02_Media/
│   └─ 01_Video/
│       └─ 01_Raw/
│           └─ match_test/
│               ├─ video1.mp4
│               ├─ video2.mov
│               ├─ audio/
│               │   ├─ audio1.wav
│               │   └─ audio2.WAV
│               ├─ extracted_audio/   # auto-generated for video-extracted wav
│               └─ output/            # matched results placed here

In the script, you can update these paths (e.g., video_folder, audio_folder, output_folder, etc.) to match your directory structure.

Usage

Clone & Setup:
```
git clone https://github.com/<your-username>/my_repo.git
cd my_repo
```
Ensure match_transcript.py is in the same folder as this README.md, and that whisper.cpp is compiled.
Install Dependencies (optional for chunk-based matching or fuzzy partial):
```
pip install rapidfuzz
```

Edit Paths:
Open match_transcript.py and update:

video_folder = "/path/to/video/folder"
audio_folder = "/path/to/audio/folder"
output_folder = "/path/to/output/folder"
extracted_audio_folder = "/path/to/extracted_audio"
whisper_dir = "/path/to/whisper.cpp"

so they reflect your actual directories.

Run the Script:
```
python match_transcript.py
```
- The script:
  1. Converts .mp4 or .mov files to .wav in extracted_audio_folder.
  2. Converts all .wav (both from videos and standalone) to 16 kHz mono _16k_mono.wav.
  3. Transcribes them with whisper.cpp.
  4. Compares transcriptions above threshold (0.6) to produce a list of matches.
  5. Copies each match (video + audio) to a subfolder in output_folder.

Examples

Simple Example

Files:

match_test/video1.mp4
match_test/audio/audio1.wav

Run:

python match_transcript.py

Output:

Creates a video1.wav in extracted_audio_folder.
Transcribes video1.wav and audio1.wav.
If overlap is detected, places copies of video1.mp4 and audio1.wav in match_test/output/video1.

Multiple Videos & Single Long Audio

Files:

video1.mp4
video2.mov
audio/long_interview.wav

If the script finds partial overlap for both videos in long_interview.wav, it will create:

output/
├─ video1/
│   ├─ video1.mp4
│   └─ long_interview.wav
├─ video2/
│   ├─ video2.mov
│   └─ long_interview.wav

Partial Ratio Matching (Optional)

If your audio is significantly larger, you can improve partial matching by using RapidFuzz partial_ratio:

from rapidfuzz import fuzz

def compare_transcriptions(video_transcriptions, audio_transcriptions):
    matches = []
    for video_file, video_text in video_transcriptions.items():
        found_any = False
        for audio_file, audio_text in audio_transcriptions.items():
            # partial_ratio for substring-based matching
            similarity = fuzz.partial_ratio(video_text, audio_text) / 100.0
            if similarity > 0.6:
                matches.append((video_file, audio_file, similarity))
                found_any = True
        if not found_any:
            print(f"No match found for {video_file}.")
    return matches

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
.gitignore		.gitignore
README.md		README.md
audioid.py		audioid.py
environment.yml		environment.yml
extract.py		extract.py
main.py		main.py
match_transcript.py		match_transcript.py
search_txt_files.sh		search_txt_files.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Audio-Video Matching with Whisper & FFmpeg

Features

Requirements

Folder Structure & Script Configuration

Usage

Examples

Simple Example

Multiple Videos & Single Long Audio

Partial Ratio Matching (Optional)

About

Uh oh!

Releases

Packages

Languages

g0udurix/match-audio

Folders and files

Latest commit

History

Repository files navigation

Audio-Video Matching with Whisper & FFmpeg

Features

Requirements

Folder Structure & Script Configuration

Usage

Examples

Simple Example

Multiple Videos & Single Long Audio

Partial Ratio Matching (Optional)

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages