ssaTranscribe is an open-source, self-hosted pipeline that delivers audio-to-text speaker diarization, voice matching, multilingual vocabulary, and keyword detection in end-to-end transcriptions of episodic media.
Developed to support the expansion of the Sesame Street Archive (SSA), the primary goal of ssaTranscribe is to transcribe 4,397 original Sesame Street episodes aired between 1969-2018, enabling a multimodal approach to scaling the SSA from ~5,000 to 35,000 labeled images across face, number, place, and word object categories.
By surfacing rich dialogue from the world's most longstanding childrenβs educational television series, ssaTranscribe identifies visually imaginative, stylistically diverse scenes that reflect children's experiences of wonder, play, and embodied learningβstrategically curating the magical content behind the SSA.
This directory tree outlines the top-level folders and essential files included in ssaTranscribe:
ssaTranscribe/
βββ data/ # Input and output data directories
β βββ raw/ # Original unprocessed audio files
β βββ processed/ # Audio clips segmented and ready for ASR
βββ models/ # Pretrained model weights
β βββ asr_model.pt # Whisper or other ASR model file
βββ scripts/ # Core logic for each pipeline step
β βββ transcriptionwithspeakersv1.py # Main script for processing audio to text
β βββ diarize.py # Performs speaker diarization
β βββ postprocess.py # Formats and cleans transcripts
βββ configs/ # YAML configuration files
β βββ default.yaml # Default settings for pipeline runs
βββ README.md # Project overview and usage
βββ requirements.txt # Python package dependencies
βββ run_pipeline.sh # End-to-end pipeline entry point
This table maps the flow of data through the ssaTranscribe pipeline, from original Sesame Street episode to readable, searchable output:
| Stage | Input Format | Process (File / Script) | Output Format | Output Location | |
|---|---|---|---|---|---|
| 1 | Audio Ingestion | .wav, .mp3 |
Raw input in data/raw/ |
(same) | data/raw/ |
| 2 | Segmentation / Prep | .wav, .mp3 |
scripts/segment.py (if used) |
.wav (cleaned/trimmed) |
data/processed/ |
| 3 | Transcription (ASR) | .wav |
scripts/transcribe.py using models/asr_model.pt |
.txt, .json |
outputs/transcripts/ |
| 4 | Diarization (Optional) | .wav |
scripts/diarize.py |
.json speaker tags |
outputs/transcripts/ |
| 5 | Postprocessing | .json, .txt |
scripts/postprocess.py |
.srt, cleaned .json or .txt |
outputs/transcripts/ |
| 6 | Configuration | .yaml |
configs/default.yaml |
(used throughout) | β |
Required GPU Type:
- .
Minimum CPU and RAM Specifications:
- .
Disk Space Requirements:
- .
Estimated Runtime Per Episode (With/Without GPU):
- .
Python Version:
- .
Environment Setup Steps:
- .
Required Python Packages:
- .
Pretrained Model Download Instructions:
- .
System-Level Dependencies:
- .
Main Configuration File Location:
- .
User-Editable Parameters:
- .
Required Environment Variables:
- .
Command-Line Arguments, Flags:
- .
Full Pipeline Execution Command:
- .
Running Individual Pipeline Components:
- .
Expected Input and Output File Types:
- .
Default Output Directory and File Locations:
- .