Discord Audio Transcript Deduplication Pipeline

This repository contains a multi-stage pipeline for processing, transcribing, and deduplicating Discord voice session recordings into clean text transcripts. It includes tools for audio capture, filtering, transcription, clustering-based deduplication, and final text output.

📚 Overview

The pipeline operates in the following phases:

Phase 0 – Discord Audio Capture
Captures user audio streams as individual .wav files and generates session logs.
Phase 1 – Audio Validation and Filtering
Filters audio for silence, duration constraints, and rescues bursty utterances with VAD.
Phase 2 – Whisper Transcription
Transcribes accepted audio files to text using a CTranslate2-based Whisper model.
Phase 3 – Deduplication by Clustering
Clusters transcriptions and deduplicates based on similarity, canonical form, and scoring.
Output – A cleaned .txt transcript preserving character, flow, and session integrity.

🛠 Scripts

Script	Purpose
`index.ts`	Captures Discord voice as per-user `.wav` files
`dedupe_audit.py`	Filters raw audio: silence, noise, duplicates, duration
`burst_scope.py`	Rescues short sharp utterances from false VAD rejection
`transcribe_accepted.py`	Transcribes accepted `.wav` files into enriched JSONL
`dedupe_transcript.py`	Deduplicates transcribed JSONL using clustering

🚀 Quick Start

Clone the repo and install required Python and Node.js dependencies.
Configure .env with your Discord bot credentials.
Run each phase in sequence:
- index.ts to capture audio.
- dedupe_audit.py to filter audio.
- transcribe_accepted.py to transcribe.
- dedupe_transcript.py to deduplicate.
Review the final transcript output.

⚡ Key Notes

Built specifically for GPU-accelerated transcription with faster-whisper.
Designed and tested on a GeForce RTX 5090 with a custom-built CTranslate2 backend.
Provided "as is", with no guarantees; it's up to you to configure and compile any needed dependencies.

📦 Installation

Python Dependencies

pip install -r requirements.txt

(further dependencies may be required)

Node.js Dependencies

npm install

(further dependencies may be required)

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
LICENSE		LICENSE
README.md		README.md
burst_scope.py		burst_scope.py
dedupe_audit.py		dedupe_audit.py
dedupe_transcript.py		dedupe_transcript.py
discord_transcript_pipeline.md		discord_transcript_pipeline.md
index.ts		index.ts
package.json		package.json
requirements.txt		requirements.txt
transcribe_accepted.py		transcribe_accepted.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Discord Audio Transcript Deduplication Pipeline

📚 Overview

🛠 Scripts

🚀 Quick Start

⚡ Key Notes

📦 Installation

Python Dependencies

Node.js Dependencies

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

Tromador/Discord-Transcription-Stack

Folders and files

Latest commit

History

Repository files navigation

Discord Audio Transcript Deduplication Pipeline

📚 Overview

🛠 Scripts

🚀 Quick Start

⚡ Key Notes

📦 Installation

Python Dependencies

Node.js Dependencies

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages