Professional audio transcription tool using OpenAI-compatible Speech-to-Text APIs with intelligent segmentation and merging.
- π₯οΈ GUI & CLI Interfaces - User-friendly graphical interface + powerful command-line tool
- π Intelligent Segmentation - Automatically splits large audio files into processable chunks
- β‘ Parallel Processing - Concurrent transcription of multiple segments for faster results
- ποΈ Speaker Diarization - Identify and label different speakers in conversations
- π AI Summarization - Generate concise summaries of transcriptions
- π Multi-Format Export - Export to DOCX, Markdown, and LaTeX with metadata
- π― Smart Merging - Overlap detection and removal for seamless final transcripts
- π Multi-Format Support - MP3, WAV, FLAC, M4A, OGG, AAC, WMA, MP4
- π Multiple Output Formats - Text, JSON, SRT, VTT subtitles, Verbose JSON, Diarized JSON
- π OpenAI-Compatible - Works with OpenAI, Ollama, Groq, LocalAI, Azure, Together.ai
- π Resume Capability - Automatically skips already processed files
- π Live Progress Tracking - Real-time ETA, throughput, and cost tracking
- π Language Detection - Automatic language detection from audio
- π° Cost Estimation - Live cost calculation during processing
- π Organized Output - Separate folders for transcriptions, segments, summaries, and exports
- Installation
- Quick Start
- Usage Examples
- Configuration
- Output Formats
- Advanced Features
- Development
- Contributing
- License
- Python 3.8 or higher
- FFmpeg (required for audio processing)
Ubuntu/Debian:
sudo apt-get update && sudo apt-get install ffmpegmacOS:
brew install ffmpegWindows:
choco install ffmpeg
# Or download from https://ffmpeg.org# Clone the repository
git clone https://github.com/lucmuss/audio-transcriber.git
cd audio-transcriber
# Create virtual environment (recommended)
uv venv
# Install package
uv sync
# Or install with development dependencies
uv sync --extra dev# Set your API key
export AUDIO_TRANSCRIBE_API_KEY="sk-..."
# Transcribe a single file
audio-transcriber --input podcast.mp3
# Transcribe all files in a directory
audio-transcriber --input ./audio_filesOutput will be saved to ./transcriptions/ by default.
# Start the GUI
audio-transcriber-guiDie GUI bietet:
- π Einfache Dateiauswahl - Browse-Buttons for Dateien und Ordner
- π API-Konfiguration - Visuelle Eingabe fΓΌr alle API-Einstellungen
- βοΈ Alle Optionen - Segment-LΓ€nge, ParallelitΓ€t, Sprache, etc.
- π Live-Fortschritt - Echtzeit-Log-Ausgabe wΓ€hrend der Verarbeitung
- π― Tooltips & Hilfe - Provider-Beispiele und Tipps direkt in der GUI
Siehe GUI_GUIDE.md fΓΌr detaillierte Anleitungen.
# Single file with OpenAI
audio-transcriber --input lecture.mp3# Start Ollama first
ollama serve
# Transcribe using local model
audio-transcriber \
--input podcast.mp3 \
--base-url http://localhost:11434/v1 \
--api-key ollama \
--model whisper# SRT format
audio-transcriber --input video.mp4 --response-format srt
# VTT format for web
audio-transcriber --input video.mp4 --response-format vtt# Longer segments for better context
audio-transcriber \
--input long_podcast.mp3 \
--segment-length 900 \
--overlap 15 \
--concurrency 6# German language
audio-transcriber --input german_audio.mp3 --language de
# Auto-detect language
audio-transcriber --input mixed_audio.mp3 --detect-language# Improve accuracy with context
audio-transcriber \
--input tech_talk.mp3 \
--prompt "This is a technical discussion about Kubernetes, Docker, and microservices. Speaker: John Smith"| Option | Description | Default |
|---|---|---|
-i, --input |
Path to audio file or directory | Required |
| Option | Description | Default |
|---|---|---|
--api-key |
API key | From AUDIO_TRANSCRIBE_API_KEY |
--base-url |
API base URL | https://api.openai.com/v1 |
--model |
Model name | gpt-4o-mini-transcribe |
| Option | Description | Default |
|---|---|---|
-o, --output-dir |
Output directory for transcriptions | ./transcriptions |
--segments-dir |
Directory for temporary segments | ./segments |
-f, --response-format |
Output format (text/json/srt/vtt/verbose_json) | text |
| Option | Description | Default |
|---|---|---|
--segment-length |
Segment length in seconds | 300 (5 min) |
--overlap |
Overlap between segments in seconds | 3 |
| Option | Description | Default |
|---|---|---|
--language |
ISO-639-1 language code (e.g., 'en', 'de') | Auto-detect |
--detect-language |
Auto-detect language from first segment | true |
--no-detect-language |
Disable language auto-detection | - |
--temperature |
Model temperature (0.0-1.0) | 0.0 |
--prompt |
Context prompt for better accuracy | None |
| Option | Description | Default |
|---|---|---|
-c, --concurrency |
Number of parallel transcriptions | 8 |
| Option | Description | Default |
|---|---|---|
--enable-diarization |
Enable speaker diarization | false |
--num-speakers |
Expected number of speakers | Auto-detect |
--known-speaker-names |
List of known speaker names | None |
--known-speaker-references |
Paths to reference audio files | None |
| Option | Description | Default |
|---|---|---|
--summarize |
Generate a summary of transcription | false |
--summary-dir |
Output directory for summaries | ./summaries |
--summary-model |
Model for summarization | gpt-4.1-mini |
--summary-prompt |
Custom prompt for summary generation | See code |
| Option | Description | Default |
|---|---|---|
--export |
Export to formats (docx, md, latex) | None |
--export-dir |
Output directory for exports | ./exports |
--export-title |
Title for exported documents | Filename |
--export-author |
Author name for exported documents | None |
| Option | Description | Default |
|---|---|---|
--no-keep-segments |
Delete temporary segment files after processing | - |
--skip-existing |
Skip files if output already exists | false |
--analyze-duration |
Analyze audio duration before processing (slower, better ETA) | false |
--dry-run |
Simulate processing without API calls | false |
-v, --verbose |
Enable verbose logging | false |
Note: By default, segments are kept and files are re-processed even if outputs exist.
export AUDIO_TRANSCRIBE_API_KEY="sk-..."
export AUDIO_TRANSCRIBE_BASE_URL="https://api.openai.com/v1"
export AUDIO_TRANSCRIBE_MODEL="gpt-4o-mini-transcribe"
export AUDIO_TRANSCRIBE_OUTPUT_DIR="./transcriptions"
export AUDIO_TRANSCRIBE_SEGMENT_LENGTH="300"
export AUDIO_TRANSCRIBE_OVERLAP="3"
export AUDIO_TRANSCRIBE_CONCURRENCY="8"This is the transcribed audio content. It's clean and readable.
{
"text": "Full transcription text",
"segments": [...],
"language": "en"
}1
00:00:00,000 --> 00:00:05,200
First subtitle line
2
00:00:05,200 --> 00:00:10,500
Second subtitle line
WEBVTT
00:00:00.000 --> 00:00:05.200
First subtitle line
00:00:05.200 --> 00:00:10.500
Second subtitle line
# Process entire directories
audio-transcriber --input ./100_podcasts --concurrency 8# Automatically skips already processed files
audio-transcriber --input ./audio_files
# Interrupt with Ctrl+C
audio-transcriber --input ./audio_files # Resumes from where it left off# Test configuration without API calls
audio-transcriber --input large_file.mp3 --dry-run# Enable speaker diarization
audio-transcriber \
--input meeting.mp3 \
--enable-diarization
# With expected number of speakers
audio-transcriber \
--input podcast.mp3 \
--enable-diarization \
--num-speakers 2
# With known speaker names and reference audio
audio-transcriber \
--input interview.mp3 \
--enable-diarization \
--known-speaker-names "Alice Smith" "Bob Johnson" \
--known-speaker-references alice_voice.wav bob_voice.wav# Generate summary of transcription
audio-transcriber \
--input lecture.mp3 \
--summarize
# Custom summary model and prompt
audio-transcriber \
--input podcast.mp3 \
--summarize \
--summary-model gpt-4o \
--summary-prompt "Summarize the key points and action items"# Export to Word document
audio-transcriber \
--input meeting.mp3 \
--export docx
# Export to multiple formats with metadata
audio-transcriber \
--input interview.mp3 \
--export docx md latex \
--export-title "Company Interview 2026" \
--export-author "John Doe"Groq (Fast):
audio-transcriber \
--api-key "gsk_..." \
--base-url "https://api.groq.com/openai/v1" \
--model "whisper-large-v3" \
--input podcast.mp3Together.ai:
audio-transcriber \
--api-key "..." \
--base-url "https://api.together.xyz/v1" \
--model "whisper" \
--input podcast.mp3# Clone and install
git clone https://github.com/lucmuss/audio-transcriber.git
cd audio-transcriber
# Install uv (if not already installed)
curl -LsSf https://astral.sh/uv/install.sh | sh
# Setup project with uv
just setup
# Install pre-commit hooks
uv run pre-commit install# Start development environment (runs docker/entrypoint.sh)
just dev
# Format and fix code
just format
# Check code quality (lint + format check)
just lint
# Run tests
just test
# Run complete quality check (lint + typecheck + test)
just check
# Clean artifacts
just clean# Run all tests
uv run pytest
# With coverage
uv run pytest --cov=audio_transcriber --cov-report=html
# Specific test file
uv run pytest tests/test_utils.py
# Type check
uv run mypy src- Cost: $0.0001 per minute (as of Jan 2026)
- Example: 60-minute podcast β $0.006
-
Increase Concurrency (if API limits allow):
--concurrency 8
-
Adjust Segment Length (larger = fewer API calls):
--segment-length 900 # 15 minutes -
Use Local Models (free & unlimited):
# Ollama, LocalAI - no costs, faster for local hardware -
Batch Processing (process multiple files efficiently):
audio-transcriber --input ./folder_with_100_files
Contributions are welcome! Please see CONTRIBUTING.md for guidelines.
- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature) - Make your changes
- Add tests
- Run quality checks
- Commit (
git commit -m 'feat: add amazing feature') - Push (
git push origin feature/amazing-feature) - Create a Pull Request
This project is licensed under the MIT License - see the LICENSE file for details.
Built with:
- OpenAI Python Client - API client
- pydub - Audio processing
- tqdm - Progress bars
- Issues: GitHub Issues
- Discussions: GitHub Discussions
- Documentation: README
Made with β€οΈ for the open-source community