Audio Transcriber 🎙️

Professional audio transcription tool using OpenAI-compatible Speech-to-Text APIs with intelligent segmentation and merging.

✨ Features

🖥️ GUI & CLI Interfaces - User-friendly graphical interface + powerful command-line tool
🔄 Intelligent Segmentation - Automatically splits large audio files into processable chunks
⚡ Parallel Processing - Concurrent transcription of multiple segments for faster results
🎙️ Speaker Diarization - Identify and label different speakers in conversations
📝 AI Summarization - Generate concise summaries of transcriptions
📄 Multi-Format Export - Export to DOCX, Markdown, and LaTeX with metadata
🎯 Smart Merging - Overlap detection and removal for seamless final transcripts
🌍 Multi-Format Support - MP3, WAV, FLAC, M4A, OGG, AAC, WMA, MP4
📋 Multiple Output Formats - Text, JSON, SRT, VTT subtitles, Verbose JSON, Diarized JSON
🔌 OpenAI-Compatible - Works with OpenAI, Ollama, Groq, LocalAI, Azure, Together.ai
🔁 Resume Capability - Automatically skips already processed files
📊 Live Progress Tracking - Real-time ETA, throughput, and cost tracking
🌐 Language Detection - Automatic language detection from audio
💰 Cost Estimation - Live cost calculation during processing
📁 Organized Output - Separate folders for transcriptions, segments, summaries, and exports

🚀 Installation

Prerequisites

Python 3.8 or higher
FFmpeg (required for audio processing)

Install FFmpeg

Ubuntu/Debian:

sudo apt-get update && sudo apt-get install ffmpeg

macOS:

brew install ffmpeg

Windows:

choco install ffmpeg
# Or download from https://ffmpeg.org

Install Audio Transcriber

# Clone the repository
git clone https://github.com/lucmuss/audio-transcriber.git
cd audio-transcriber

# Create virtual environment (recommended)
uv venv

# Install package
uv sync

# Or install with development dependencies
uv sync --extra dev

⚡ Quick Start

CLI (Command Line)

# Set your API key
export AUDIO_TRANSCRIBE_API_KEY="sk-..."

# Transcribe a single file
audio-transcriber --input podcast.mp3

# Transcribe all files in a directory
audio-transcriber --input ./audio_files

Output will be saved to ./transcriptions/ by default.

GUI (Graphical Interface)

# Start the GUI
audio-transcriber-gui

Die GUI bietet:

📁 Einfache Dateiauswahl - Browse-Buttons for Dateien und Ordner
🔌 API-Konfiguration - Visuelle Eingabe für alle API-Einstellungen
⚙️ Alle Optionen - Segment-Länge, Parallelität, Sprache, etc.
📊 Live-Fortschritt - Echtzeit-Log-Ausgabe während der Verarbeitung
🎯 Tooltips & Hilfe - Provider-Beispiele und Tipps direkt in der GUI

Siehe GUI_GUIDE.md für detaillierte Anleitungen.

📚 Usage Examples

Basic Transcription

# Single file with OpenAI
audio-transcriber --input lecture.mp3

🌐 Use with Local Ollama (Free & Private)

# Start Ollama first
ollama serve

# Transcribe using local model
audio-transcriber \
  --input podcast.mp3 \
  --base-url http://localhost:11434/v1 \
  --api-key ollama \
  --model whisper

📝 Generate Subtitles

# SRT format
audio-transcriber --input video.mp4 --response-format srt

# VTT format for web
audio-transcriber --input video.mp4 --response-format vtt

🎯 Custom Segmentation

# Longer segments for better context
audio-transcriber \
  --input long_podcast.mp3 \
  --segment-length 900 \
  --overlap 15 \
  --concurrency 6

🌍 Language-Specific Transcription

# German language
audio-transcriber --input german_audio.mp3 --language de

# Auto-detect language
audio-transcriber --input mixed_audio.mp3 --detect-language

🎨 With Context Prompt

# Improve accuracy with context
audio-transcriber \
  --input tech_talk.mp3 \
  --prompt "This is a technical discussion about Kubernetes, Docker, and microservices. Speaker: John Smith"

⚙️ Configuration

Command-Line Options

Required Arguments

Option	Description	Default
`-i, --input`	Path to audio file or directory	Required

API Configuration

Option	Description	Default
`--api-key`	API key	From `AUDIO_TRANSCRIBE_API_KEY`
`--base-url`	API base URL	`https://api.openai.com/v1`
`--model`	Model name	`gpt-4o-mini-transcribe`

Output Configuration

Option	Description	Default
`-o, --output-dir`	Output directory for transcriptions	`./transcriptions`
`--segments-dir`	Directory for temporary segments	`./segments`
`-f, --response-format`	Output format (text/json/srt/vtt/verbose_json)	`text`

Segmentation Parameters

Option	Description	Default
`--segment-length`	Segment length in seconds	`300` (5 min)
`--overlap`	Overlap between segments in seconds	`3`

Transcription Parameters

Option	Description	Default
`--language`	ISO-639-1 language code (e.g., 'en', 'de')	Auto-detect
`--detect-language`	Auto-detect language from first segment	`true`
`--no-detect-language`	Disable language auto-detection	-
`--temperature`	Model temperature (0.0-1.0)	`0.0`
`--prompt`	Context prompt for better accuracy	None

Performance Parameters

Option	Description	Default
`-c, --concurrency`	Number of parallel transcriptions	`8`

Diarization Parameters (Speaker Recognition)

Option	Description	Default
`--enable-diarization`	Enable speaker diarization	`false`
`--num-speakers`	Expected number of speakers	Auto-detect
`--known-speaker-names`	List of known speaker names	None
`--known-speaker-references`	Paths to reference audio files	None

Summarization Parameters

Option	Description	Default
`--summarize`	Generate a summary of transcription	`false`
`--summary-dir`	Output directory for summaries	`./summaries`
`--summary-model`	Model for summarization	`gpt-4.1-mini`
`--summary-prompt`	Custom prompt for summary generation	See code

Export Parameters

Option	Description	Default
`--export`	Export to formats (docx, md, latex)	None
`--export-dir`	Output directory for exports	`./exports`
`--export-title`	Title for exported documents	Filename
`--export-author`	Author name for exported documents	None

Behavior Options

Option	Description	Default
`--no-keep-segments`	Delete temporary segment files after processing	-
`--skip-existing`	Skip files if output already exists	`false`
`--analyze-duration`	Analyze audio duration before processing (slower, better ETA)	`false`
`--dry-run`	Simulate processing without API calls	`false`
`-v, --verbose`	Enable verbose logging	`false`

Note: By default, segments are kept and files are re-processed even if outputs exist.

Environment Variables

export AUDIO_TRANSCRIBE_API_KEY="sk-..."
export AUDIO_TRANSCRIBE_BASE_URL="https://api.openai.com/v1"
export AUDIO_TRANSCRIBE_MODEL="gpt-4o-mini-transcribe"
export AUDIO_TRANSCRIBE_OUTPUT_DIR="./transcriptions"
export AUDIO_TRANSCRIBE_SEGMENT_LENGTH="300"
export AUDIO_TRANSCRIBE_OVERLAP="3"
export AUDIO_TRANSCRIBE_CONCURRENCY="8"

📄 Output Formats

Text (Default)

This is the transcribed audio content. It's clean and readable.

JSON

{
  "text": "Full transcription text",
  "segments": [...],
  "language": "en"
}

SRT Subtitles

1
00:00:00,000 --> 00:00:05,200
First subtitle line

2
00:00:05,200 --> 00:00:10,500
Second subtitle line

VTT Subtitles

WEBVTT

00:00:00.000 --> 00:00:05.200
First subtitle line

00:00:05.200 --> 00:00:10.500
Second subtitle line

🔧 Advanced Features

Batch Processing

# Process entire directories
audio-transcriber --input ./100_podcasts --concurrency 8

Resume Failed Jobs

# Automatically skips already processed files
audio-transcriber --input ./audio_files
# Interrupt with Ctrl+C
audio-transcriber --input ./audio_files  # Resumes from where it left off

Dry Run Mode

# Test configuration without API calls
audio-transcriber --input large_file.mp3 --dry-run

Speaker Diarization (Who Said What)

# Enable speaker diarization
audio-transcriber \
  --input meeting.mp3 \
  --enable-diarization

# With expected number of speakers
audio-transcriber \
  --input podcast.mp3 \
  --enable-diarization \
  --num-speakers 2

# With known speaker names and reference audio
audio-transcriber \
  --input interview.mp3 \
  --enable-diarization \
  --known-speaker-names "Alice Smith" "Bob Johnson" \
  --known-speaker-references alice_voice.wav bob_voice.wav

AI Summarization

# Generate summary of transcription
audio-transcriber \
  --input lecture.mp3 \
  --summarize

# Custom summary model and prompt
audio-transcriber \
  --input podcast.mp3 \
  --summarize \
  --summary-model gpt-4o \
  --summary-prompt "Summarize the key points and action items"

Document Export

# Export to Word document
audio-transcriber \
  --input meeting.mp3 \
  --export docx

# Export to multiple formats with metadata
audio-transcriber \
  --input interview.mp3 \
  --export docx md latex \
  --export-title "Company Interview 2026" \
  --export-author "John Doe"

Integration with Other Services

Groq (Fast):

audio-transcriber \
  --api-key "gsk_..." \
  --base-url "https://api.groq.com/openai/v1" \
  --model "whisper-large-v3" \
  --input podcast.mp3

Together.ai:

audio-transcriber \
  --api-key "..." \
  --base-url "https://api.together.xyz/v1" \
  --model "whisper" \
  --input podcast.mp3

💻 Development

Setup Development Environment

# Clone and install
git clone https://github.com/lucmuss/audio-transcriber.git
cd audio-transcriber

# Install uv (if not already installed)
curl -LsSf https://astral.sh/uv/install.sh | sh

# Setup project with uv
just setup

# Install pre-commit hooks
uv run pre-commit install

Development Workflow

# Start development environment (runs docker/entrypoint.sh)
just dev

# Format and fix code
just format

# Check code quality (lint + format check)
just lint

# Run tests
just test

# Run complete quality check (lint + typecheck + test)
just check

# Clean artifacts
just clean

Manual Commands (Alternative)

# Run all tests
uv run pytest

# With coverage
uv run pytest --cov=audio_transcriber --cov-report=html

# Specific test file
uv run pytest tests/test_utils.py

# Type check
uv run mypy src

📊 Performance & Costs

OpenAI Whisper Pricing

Cost: $0.0001 per minute (as of Jan 2026)
Example: 60-minute podcast ≈ $0.006

Performance Tips

Increase Concurrency (if API limits allow):
```
--concurrency 8
```
Adjust Segment Length (larger = fewer API calls):
```
--segment-length 900  # 15 minutes
```

Use Local Models (free & unlimited):

# Ollama, LocalAI - no costs, faster for local hardware

Batch Processing (process multiple files efficiently):
```
audio-transcriber --input ./folder_with_100_files
```

🤝 Contributing

Contributions are welcome! Please see CONTRIBUTING.md for guidelines.

Quick Contribution Steps

Fork the repository
Create a feature branch (git checkout -b feature/amazing-feature)
Make your changes
Add tests
Run quality checks
Commit (git commit -m 'feat: add amazing feature')
Push (git push origin feature/amazing-feature)
Create a Pull Request

📝 License

This project is licensed under the MIT License - see the LICENSE file for details.

🙏 Acknowledgments

Built with:

OpenAI Python Client - API client
pydub - Audio processing
tqdm - Progress bars

📞 Support

Issues: GitHub Issues
Discussions: GitHub Discussions
Documentation: README

Made with ❤️ for the open-source community

Name		Name	Last commit message	Last commit date
Latest commit History 33 Commits
.github/workflows		.github/workflows
docker		docker
docs		docs
examples		examples
src/audio_transcriber		src/audio_transcriber
tests		tests
.dockerignore		.dockerignore
.env.example		.env.example
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
Justfile		Justfile
LICENSE		LICENSE
README.md		README.md
build_binary.py		build_binary.py
docker-compose.yml		docker-compose.yml
mypy.ini		mypy.ini
pyproject.toml		pyproject.toml
pytest.ini		pytest.ini
requirements.txt		requirements.txt
setup.cfg		setup.cfg
uv.lock		uv.lock

License

lucmuss/audio-transcriber

Folders and files

Latest commit

History

Repository files navigation