Skip to content

g0udurix/match-audio

 
 

Repository files navigation

Audio-Video Matching with Whisper & FFmpeg

This repository contains a Python script, match_transcript.py, that automatically:

  1. Extracts audio from video files,
  2. Transcribes both video-extracted audio and standalone audio files using whisper.cpp,
  3. Compares transcriptions to find overlapping spoken content (even if audio is longer than video),
  4. Copies matched video and audio files into organized folders.

This approach is useful when you have a folder of video recordings and a separate folder of audio recordings and need to figure out which audio files (possibly very long) overlap with which video files.


Features

  • Automatic Audio Extraction: Uses FFmpeg to create a .wav (16 kHz, mono) from each video file.
  • Speech-to-Text with Whisper: Calls whisper.cpp to transcribe .wav files, skipping re-transcription if a .txt file already exists.
  • Partial Overlap Support: Incorporates a fuzzy matching approach (or optional chunk-based approach) so a long audio file can match multiple short video clips if content overlaps.
  • Many-to-One and One-to-Many: A single audio file can match multiple video files, copying the same audio file into multiple match folders.
  • Cleanup: Removes intermediate _16k_mono.wav files after transcription to avoid clutter.
  • Customizable Similarity: By default, uses a 0.6 threshold for “good enough” spoken overlap.

Requirements

  1. Python 3.7+
  2. FFmpeg
    • On macOS:
      brew install ffmpeg
    • On Linux (Ubuntu/Debian):
      sudo apt-get update
      sudo apt-get install ffmpeg
  3. whisper.cpp
    • The script assumes you have main compiled and available in whisper.cpp.
    • Example:
      git clone https://github.com/ggerganov/whisper.cpp
      cd whisper.cpp
      make
    • Place your model (e.g., ggml-small.bin) in whisper.cpp/models/ or update the script’s model path.
  4. Optional (for partial matching):
    • pip install rapidfuzz (or pip install fuzzywuzzy) if you’d like to use a partial ratio for more robust substring-based matching (see Advanced Usage).

Folder Structure & Script Configuration

Below is an example setup:

my_repo/
├─ match_transcript.py
├─ README.md
├─ whisper.cpp/         # (cloned & compiled whisper)
├─ 02_Media/
│   └─ 01_Video/
│       └─ 01_Raw/
│           └─ match_test/
│               ├─ video1.mp4
│               ├─ video2.mov
│               ├─ audio/
│               │   ├─ audio1.wav
│               │   └─ audio2.WAV
│               ├─ extracted_audio/   # auto-generated for video-extracted wav
│               └─ output/            # matched results placed here

In the script, you can update these paths (e.g., video_folder, audio_folder, output_folder, etc.) to match your directory structure.


Usage

  1. Clone & Setup:

    git clone https://github.com/<your-username>/my_repo.git
    cd my_repo

    Ensure match_transcript.py is in the same folder as this README.md, and that whisper.cpp is compiled.

  2. Install Dependencies (optional for chunk-based matching or fuzzy partial):

    pip install rapidfuzz
  3. Edit Paths:
    Open match_transcript.py and update:

    video_folder = "/path/to/video/folder"
    audio_folder = "/path/to/audio/folder"
    output_folder = "/path/to/output/folder"
    extracted_audio_folder = "/path/to/extracted_audio"
    whisper_dir = "/path/to/whisper.cpp"

    so they reflect your actual directories.

  4. Run the Script:

    python match_transcript.py
    • The script:
      1. Converts .mp4 or .mov files to .wav in extracted_audio_folder.
      2. Converts all .wav (both from videos and standalone) to 16 kHz mono _16k_mono.wav.
      3. Transcribes them with whisper.cpp.
      4. Compares transcriptions above threshold (0.6) to produce a list of matches.
      5. Copies each match (video + audio) to a subfolder in output_folder.

Examples

Simple Example

Files:

  • match_test/video1.mp4
  • match_test/audio/audio1.wav

Run:

python match_transcript.py

Output:

  • Creates a video1.wav in extracted_audio_folder.
  • Transcribes video1.wav and audio1.wav.
  • If overlap is detected, places copies of video1.mp4 and audio1.wav in match_test/output/video1.

Multiple Videos & Single Long Audio

Files:

  • video1.mp4
  • video2.mov
  • audio/long_interview.wav

If the script finds partial overlap for both videos in long_interview.wav, it will create:

output/
├─ video1/
│   ├─ video1.mp4
│   └─ long_interview.wav
├─ video2/
│   ├─ video2.mov
│   └─ long_interview.wav

Partial Ratio Matching (Optional)

If your audio is significantly larger, you can improve partial matching by using RapidFuzz partial_ratio:

from rapidfuzz import fuzz

def compare_transcriptions(video_transcriptions, audio_transcriptions):
    matches = []
    for video_file, video_text in video_transcriptions.items():
        found_any = False
        for audio_file, audio_text in audio_transcriptions.items():
            # partial_ratio for substring-based matching
            similarity = fuzz.partial_ratio(video_text, audio_text) / 100.0
            if similarity > 0.6:
                matches.append((video_file, audio_file, similarity))
                found_any = True
        if not found_any:
            print(f"No match found for {video_file}.")
    return matches

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 97.6%
  • Shell 2.4%