Transcriber is a local audio transcription tool powered by WhisperX. It transcribes recordings with word-level timestamps and optional speaker diarization. Everything runs locally — no audio ever leaves your machine.
It ships as three complementary interfaces built on the same ML core: a React web application served by a FastAPI inference server, a command-line tool for batch processing, and an importable Python client library designed for GUI apps and remote-machine workflows.
- WhisperX-powered ASR — automatic language detection, supports
tinythroughlarge-v3model sizes - Speaker diarization — pyannote identifies who said what; word-level forced alignment runs automatically; custom re-segmentation ensures every segment belongs to exactly one speaker; graceful fallback on failure
- React web UI — drag-and-drop upload, browser microphone recording, real-time SSE progress, three transcript views (segments, script, plain text), copy/download menus, dark mode, PWA-ready
- Multiple server interfaces — REST multipart and raw-bytes endpoints, SSE streaming variants, WebSocket
- Markdown & JSON output — speaker-grouped transcript ready to paste into an LLM for summarization
- Lightweight client library — non-blocking microphone recorder + REST/SSE/WebSocket sender; no ML dependency; designed for PySide6 / Tkinter / script workflows
- CPU or CUDA — auto-detected compute type; float16/int8/float32 precision
- Fully offline after first run — pre-download all models with
trans-cache - AWS SageMaker–ready — container satisfies the BYOC contract out of the box (
/ping,/invocations, port 8080); supports real-time and asynchronous endpoints
Run the inference server once, open a browser, and you're done. The built React UI is served automatically.
uv sync --extra server
uv run trans-server # GPU + large-v3 on 0.0.0.0:8080
# open http://localhost:8080Drag-and-drop an audio file (or record straight from the browser microphone), tick Diarize if you want speaker labels, and hit Transcribe. Results stream in live via SSE.
uv sync --extra cli
uv run trans-cli meeting.mp3 --diarize -f mdDeploy the server on a GPU machine (or SageMaker) and use the lightweight client library from any laptop — no ML stack required on the client side.
# On the GPU machine:
uv sync --extra server ; uv run trans-server
# On the remote machine:
uv sync --extra clientfrom transcriber.client import start_recording, stop_recording, transcribe_sse
start_recording()
input("Recording… press Enter to stop")
wav = stop_recording()
result = transcribe_sse(wav, url="http://gpu-machine:8080", diarize=True)
print(result["transcript"])Build the Docker image once, push to ECR, and deploy as a real-time or asynchronous endpoint. See SAGEMAKER.md for the full guide.
The code is split into six sub-packages with a strict layering rule: ML model calls stay in pipeline/, data movement stays in io/, shared types and config stay in core/, user-facing interfaces stay in cli/ and server/, and the lightweight network client lives in client/. This means replacing WhisperX with a different backend only touches pipeline/.
| Package | What it contains | Key responsibility |
|---|---|---|
core/ |
Config dataclass, Pydantic result/segment models | Single source of truth for settings and data shapes shared across all other packages |
io/ |
Audio loader, transcript writer | Format-level I/O with no model dependency — decodes audio, writes Markdown/JSON |
pipeline/ |
Transcription orchestrator, alignment, diarization | All ML inference; owns the WhisperX and pyannote calls |
cli/ |
Argument parser, Rich display, entry point, cache manager | User interface only; delegates everything to pipeline/ and io/ |
server/ |
FastAPI app, routes, inference worker | HTTP/WebSocket server; single-GPU queue-based inference |
client/ |
Microphone recorder, REST/WS sender | Start/stop recording API designed for GUI apps and scripts; no ML dependency |
TranscriptionPipeline (in pipeline/) sequences the stages below. The Whisper model is loaded once on the first call and reused, so processing multiple files in one session is cheap.
- Audio decoding (
io/) — any supported container is decoded into a 16 kHz mono float32 array. This is the exact format WhisperX requires, so no conversion happens inside the model code. - ASR transcription (
pipeline/) — WhisperX batch inference. Produces text segments with coarse sentence-level timestamps. - Speaker diarization (
pipeline/, optional) — when enabled, first runs a forced-alignment pass (wav2vec2-based) to refine coarse timestamps to individual words, then pyannote detects speaker-turn boundaries and WhisperX assigns a speaker label to each word. A custom re-segmentation pass splits any segment that crosses a speaker boundary, so every output segment belongs to exactly one speaker. WhisperX's default majority-vote assignment does not do this split. If diarization fails for any reason, the pipeline falls back to the undiarized transcript rather than crashing. - Output rendering (
io/) — the typed result is serialised to Markdown or JSON. Speaker turns are grouped into labelled blocks with timestamps, ready to paste into an LLM.
Install uv — it manages Python itself, so no separate Python installation is needed:
# Linux / macOS (bash)
curl -Ls https://astral.sh/uv/install.sh | sh
# Windows (PowerShell)
powershell -ExecutionPolicy ByPass -c "irm https://astral.sh/uv/install.ps1 | iex"uv sync will automatically download and use Python 3.12 as declared in pyproject.toml.
WhisperX and PyAV require FFmpeg libraries at runtime.
Windows:
- Download the full build (not "essentials") from gyan.dev/ffmpeg/builds — get
ffmpeg-release-full.7z - Extract to a permanent location, e.g.
C:\ffmpeg - Add the
binfolder to your systemPATH:
# Run as Administrator
[Environment]::SetEnvironmentVariable(
"Path",
[Environment]::GetEnvironmentVariable("Path", "Machine") + ";C:\ffmpeg\bin",
"Machine"
)- Verify:
ffmpeg -versionLinux / macOS:
sudo apt install ffmpeg # Ubuntu/Debian
brew install ffmpeg # macOSFor GPU-accelerated transcription:
- NVIDIA GPU with compute capability >= 7.0 (for float16)
- NVIDIA CUDA Toolkit
- Latest NVIDIA GPU drivers
Verify:
nvidia-smiCPU mode works but is significantly slower (~10x).
git clone https://github.com/galuszkm/transcriber.git
cd transcriber
# CLI transcription (includes ML engine)
uv sync --extra cli
# HTTP/WebSocket server (includes ML engine)
uv sync --extra server
# Microphone recording client (lightweight, no ML)
uv sync --extra client
# Everything (recommended)
uv sync --all-extras
# Dev tools + git hooks
uv run just installAll configuration is managed through .env file in the project root. The app loads it automatically via pydantic-settings. CLI arguments always override .env values.
# --- Shared settings (used by all commands) ---
# HuggingFace token (required only for speaker diarization)
HF_TOKEN=hf_your_token_here
# Whisper model size (default: large-v3)
MODEL=large-v3
# Compute device: cpu or cuda
DEVICE=cuda
# Enable speaker diarization (default: false)
DIARIZE=false
# Comma-separated language codes for cache pre-download (default: all)
# Only used by trans-cache; inference always auto-detects.
LANGUAGE=en,pl,de
# Override default cache directory (default: <cwd>/.cache)
CACHE_DIR=D:\models\cacheAll models and data are stored under a single cache root (default .cache/):
.cache/
├── huggingface/ # HuggingFace model weights
│ ├── hub/ # Whisper CTranslate2 + wav2vec2 alignment models
│ └── datasets/
├── torch/ # PyTorch hub models (torchaudio alignment models)
│ └── hub/checkpoints/
├── nltk/ # NLTK tokenizer data
│ └── tokenizers/punkt_tab/
└── whisperx/ # Diarization cache (pyannote)
Diarization models are free and MIT-licensed, but gated (accept terms once):
- Create a free account at huggingface.co
- Create a Read token at huggingface.co/settings/tokens
- Visit pyannote/speaker-diarization-community-1 and click "Agree and access repository"
- Add the token to
.env
Models are downloaded once on first use and cached locally. Fully offline after that.
Pre-download all models for fully offline operation. Reads defaults from .env (MODEL, LANGUAGE, DEVICE, DIARIZE) — CLI flags override. When no languages are specified, all languages with alignment models are cached.
# Download models using .env defaults (all languages)
uv run trans-cache
# Check what's already cached (no downloads)
uv run trans-cache --check
# Download specific models and languages
uv run trans-cache --models base large-v3 --languages en pl de zh
# Cache everything: all models, all languages, diarization
uv run trans-cache --all
# Include diarization models (requires HF_TOKEN in .env)
uv run trans-cache --diarize
# Custom cache directory
uv run trans-cache --cache-dir D:\models\cache| Flag | Description | Default (from .env) |
|---|---|---|
--models SIZE [...] |
Whisper model sizes | MODEL or large-v3 |
--languages LANG [...] |
ISO language codes for alignment | LANGUAGE or all available |
--device cpu|cuda |
Device for model loading | DEVICE or cpu |
--diarize |
Also download diarization models | DIARIZE or false |
--all |
Cache all models, all languages, diarization | — |
--cache-dir PATH |
Override cache directory | CACHE_DIR or .cache |
--check |
Report cache status, download nothing | — |
✓ NLTK punkt_tab
✓ Whisper large-v3 (Systran/faster-whisper-large-v3)
✓ align en (WAV2VEC2_ASR_BASE_960H, torchaudio)
✓ align pl (jonatasgrosman/wav2vec2-large-xlsr-53-polish)
✓ align de (VOXPOPULI_ASR_BASE_10K_DE, torchaudio)
✗ align zh (jonatasgrosman/wav2vec2-large-xlsr-53-chinese-zh-cn)
# GPU with large model (default)
uv run trans-cli recording.wav
# CPU with smaller model
uv run trans-cli recording.mp3 -m base -d cpu
# With speaker detection
uv run trans-cli meeting.mp3 --diarize
# JSON output + custom output path
uv run trans-cli meeting.mp3 -f json -o ./transcripts/meeting| Flag | Description | Default |
|---|---|---|
-m, --model |
Model size: tiny, base, small, medium, large-v2, large-v3 | large-v3 |
-d, --device |
Compute device (cpu, cuda) | cuda |
-c, --compute-type |
Precision: float16, int8, float32, auto | auto |
-f, --format |
Output format: md, json | md |
-o, --output |
Output path without extension | input filename |
-b, --batch-size |
Inference batch size | 16 |
--diarize |
Enable speaker diarization | off |
--hf-token |
HuggingFace token (overrides HF_TOKEN env var) |
from .env |
--cache-dir |
Cache directory | CACHE_DIR or .cache |
.wav, .mp3, .flac, .ogg, .m4a, .wma, .aac, .mp4, .webm
| Model | Parameters | VRAM | Relative speed | Best for |
|---|---|---|---|---|
tiny |
39M | ~1 GB | fastest | Quick tests |
base |
74M | ~1 GB | fast | Development, English-only |
small |
244M | ~2 GB | moderate | Good balance |
medium |
769M | ~5 GB | slow | Better multilingual |
large-v2 |
1550M | ~10 GB | slowest | High accuracy |
large-v3 |
1550M | ~10 GB | slowest | Best accuracy, especially non-English |
HTTP/WebSocket server for integrating transcription into other applications. Runs a single-GPU inference worker with a FIFO queue.
# Start server (defaults: cuda, large-v3, 0.0.0.0:8080)
uv run trans-server
# Custom port and model
uv run trans-server --port 9876 -m base
# CPU mode
uv run trans-server -d cpu --port 9876| Flag | Description | Default |
|---|---|---|
--host |
Bind address | 0.0.0.0 |
--port |
Port number | 8080 |
--prefix PATH |
URL prefix for routes behind a reverse proxy | "" |
-m, --model |
Whisper model size | large-v3 |
-d, --device |
Compute device | cuda |
-c, --compute-type |
Precision | auto |
-b, --batch-size |
Batch size | 16 |
--hf-token |
HuggingFace token | from .env |
--cache-dir |
Cache directory | .cache |
GET /health
Returns model status, device, and queue size.
Multiple input methods, all returning the same TranscribeResponse:
| Endpoint | Content-Type | Input |
|---|---|---|
POST /transcribe |
multipart/form-data |
file field (audio file) or audio_base64 field |
POST /transcribe/json |
application/json |
{"audio_base64": "...", "diarize": false} |
POST /transcribe/raw |
application/octet-stream |
Raw audio bytes |
All accept ?diarize=true query parameter (or diarize field in JSON body).
Response:
{
"transcript": "Full text...",
"segments": [
{"start": 0.5, "end": 2.1, "text": "Hello everyone.", "speaker": "SPEAKER_00"}
],
"language": "en",
"duration": 45.2,
"timings": {"model_load": 0.01, "transcribe": 3.2}
}Same input methods with /stream suffix — returns Server-Sent Events with progress updates:
| Endpoint | Input |
|---|---|
POST /transcribe/stream |
Multipart or base64 form |
POST /transcribe/json/stream |
JSON body |
POST /transcribe/raw/stream |
Raw bytes |
SSE events:
event: progress
data: {"stage": "transcribe", "message": "Transcribing speech"}
event: complete
data: {"transcript": "...", "segments": [...], ...}
WS /ws/transcribe?diarize=true
Send binary audio frame → receive JSON response. Connection stays open for multiple exchanges.
The server automatically serves a built React application when the server/static/ directory is present. Build it once before starting the server:
# Windows
.\ui\build.bat
# Linux / macOS
./ui/build.shThen start the server normally (uv run trans-server). Open http://localhost:8080 in a browser.
| Feature | Details |
|---|---|
| File upload | Drag-and-drop or click-to-browse; accepts all supported audio formats |
| Microphone recording | Uses the browser's MediaRecorder API; shows elapsed timer + REC badge; auto-stops at 5 minutes |
| Audio preview | Inline HTML5 player shown once a file is selected |
| Options | Diarize checkbox, Auto-copy toggle |
| SSE progress | Live progress messages stream in during transcription (load → transcribe → align → diarize) |
| Transcript views | Three tabs: Segments (timestamped table with colour-coded speakers), Script (movie-script grouped by speaker), Plain text |
| Copy / Download | Menus for plain text, Markdown, segments JSON, full JSON; .txt, .md, .json downloads |
| Dark mode | Light/dark toggle persisted to localStorage; follows AWS Amplify theme |
| PWA | Service-worker registered; can be installed as a standalone app |
The app resolves API base URLs from window.__SERVER_CONFIG__ injected into index.html at serve time, so it works correctly behind any reverse proxy or non-root prefix.
A lightweight, importable library for recording audio and sending it to the server. No ML dependencies — install with the client extra:
uv sync --extra clientThe recording API is non-blocking and designed for GUI apps: start_recording returns immediately while audio is captured in a background thread. Call stop_recording whenever the user is done, regardless of how much time has passed.
For sending audio to the server, prefer transcribe_sse over transcribe_rest for long recordings — it streams progress events back while the pipeline runs, so the HTTP connection never idles long enough to time out.
from transcriber.client import (
RecordingError,
is_recording,
list_devices,
start_recording,
stop_recording,
transcribe_rest,
)
# List available input devices
print(list_devices())
def on_record_button_clicked():
if is_recording():
wav = stop_recording() # returns WAV bytes immediately
result = transcribe_rest(wav) # blocks until server responds
print(result["transcript"])
else:
start_recording(device=2) # returns immediately, records in backgroundimport time
from transcriber.client import start_recording, stop_recording, transcribe_sse
start_recording(max_duration=60.0) # safety cap; stops automatically at 60 s
time.sleep(30) # record for 30 seconds
wav = stop_recording()
result = transcribe_sse(
wav,
url="http://localhost:8080",
diarize=True,
on_progress=lambda stage, msg: print(f"[{stage}] {msg}"),
)
print(result["transcript"])| Symbol | Description |
|---|---|
start_recording(*, device, sample_rate, channels, max_duration) |
Begin capturing; returns immediately |
stop_recording() |
Stop and return in-memory WAV bytes |
is_recording() |
True if a session is active |
list_devices() |
Human-readable list of available audio devices |
transcribe_rest(wav_bytes, *, url, diarize, timeout) |
POST to /transcribe; returns parsed result dict |
transcribe_sse(wav_bytes, *, url, diarize, timeout, on_progress) |
Stream via SSE; calls on_progress(stage, message) for each progress event; returns parsed result dict — prefer over REST for large files |
transcribe_ws(wav_bytes, *, url, diarize) |
Send over WebSocket; returns parsed result dict |
RecordingError |
Raised on invalid operations (already recording, not recording) |
The server is ready to deploy on AWS SageMaker as a Bring Your Own Container (BYOC) endpoint — port 8080 and 0.0.0.0 binding are the defaults. A Dockerfile and entrypoint script are included at the repository root.
See SAGEMAKER.md for the full deployment guide covering:
- Building and pushing the Docker image to ECR
- Real-time endpoints (≤ 60 s, ≤ 25 MB)
- Asynchronous endpoints (up to 1 hour, up to 1 GB payload)
- boto3 examples for model creation, endpoint deployment, and invocation
Run uv run just to list all available commands.
Two commands cover everything before a commit:
uv run just check # format + lint + type check + security + pre-commit hooks
uv run just test # run the test suite with coverage| Command | Description |
|---|---|
uv run just install |
Install all deps + git hooks |
uv run just check |
Run all checks (format, lint, types, security, hooks, audit) |
uv run just test |
Run tests with coverage |
uv run just format |
Auto-format code (imports + source) |
uv run just clean |
Remove build/cache/coverage artifacts |
uv run just commit-files |
Create a conventional commit interactively |
