TikTok-Style Video Captioning

A Node.js/TypeScript pipeline for processing vertical videos (1080x1920) with:

OpenAI Whisper transcription (word-level timestamps)
Filler word and pause detection/removal
GPT-4o emphasis word detection
Remotion-powered TikTok-style animated captions (word-by-word pop-in, green highlights)

Requirements

System Dependencies

Node.js 18+ and npm
FFmpeg - required for audio extraction and video cutting
- macOS: brew install ffmpeg
- Ubuntu/Debian: sudo apt install ffmpeg
- Windows: Download from ffmpeg.org or use choco install ffmpeg

API Keys

OpenAI API key with access to:
- Whisper API - for audio transcription with word-level timestamps
- GPT-4o - for emphasis word detection

Automatic Dependencies

Chromium - Remotion automatically downloads a compatible Chromium version on first render (no manual install needed)

Installation

npm install
cp .env.example .env
# Edit .env and add your OPENAI_API_KEY

Usage

Quick Start

Place a video in the input/ folder (e.g., input/test.mp4)
Run the first stage (transcription + filler analysis):
```
npm run pipeline
```
Review and edit the JSON files in data/ to adjust cuts and emphasis words
Run the second stage (emphasis detection + video cutting + rendering):
```
npm run continue
```
Find your captioned video in output/ (e.g., output/test_captioned.mp4)

Individual Scripts

Run each step separately for more control:

npm run transcribe        # Step 1: Whisper transcription
npm run analyze-fillers   # Step 2: Detect fillers and pauses
npm run detect-emphasis   # Step 3: GPT-4o emphasis detection
npm run generate-timing   # Step 4: Adjust timestamps for cuts
npm run cut-video         # Step 5: FFmpeg segment removal
npm run render            # Step 6: Remotion rendering

Preview

Preview the Remotion composition before rendering:

npm run preview

Project Structure

├── package.json              # Dependencies & npm scripts
├── tsconfig.json             # TypeScript config
├── remotion.config.ts        # Remotion config
├── .env                      # OPENAI_API_KEY (create from .env.example)
├── input/                    # Place input videos here
├── output/                   # Final captioned videos
├── data/                     # Editable JSON files (review step)
│   ├── 01_transcription.json
│   ├── 02_filler_analysis.json
│   ├── 03_cuts.json
│   ├── 04_emphasis.json
│   └── 05_caption_timing.json
├── temp/                     # Intermediate files
└── src/
    ├── types.ts              # TypeScript interfaces
    ├── config.ts             # Pipeline configuration
    ├── scripts/
    │   ├── 01-transcribe.ts      # Whisper API transcription
    │   ├── 02-analyze-fillers.ts # Detect fillers/pauses
    │   ├── 03-detect-emphasis.ts # GPT-4o emphasis detection
    │   ├── 04-generate-timing.ts # Adjust timestamps for cuts
    │   ├── 05-cut-video.ts       # FFmpeg segment removal
    │   └── 06-render.ts          # Remotion rendering
    └── remotion/
        ├── index.ts
        ├── Root.tsx              # Composition definition
        ├── Video.tsx             # Video + captions overlay
        ├── Captions.tsx          # Page-based caption display
        ├── Word.tsx              # Spring pop-in animation
        └── styles.ts             # TikTok-style fonts/colors

Workflow Details

Step 1: Transcription

Extracts audio from the video and sends it to OpenAI Whisper API with word-level timestamps.

Step 2: Filler Analysis

Detects filler words (um, uh, etc.) and long pauses. Generates suggested cuts that you can review and edit in data/03_cuts.json.

Step 3: Emphasis Detection

Uses GPT-4o to identify impactful words (15-25% of total) that should be highlighted in the captions.

Step 4: Generate Timing

Adjusts word timestamps based on the cuts and paginates words into caption pages (4 words per page by default).

Step 5: Cut Video

Uses FFmpeg to remove the marked segments (fillers and long pauses) from the video.

Step 6: Render

Uses Remotion to render the final video with animated captions overlaid.

Configuration

Edit src/config.ts to customize:

Filler word patterns: Words detected as fillers
Pause thresholds: When to detect/auto-remove pauses (default: detect at 500ms, auto-remove at 1000ms)
Caption styling: Colors, fonts, positioning
Words per page: Number of words shown at once (default: 4)

Caption Styling

Font: Montserrat 72px bold, uppercase
Normal words: White (#FFFFFF)
Emphasis words: TikTok green (#00FF88) with glow effect
Animation: Spring pop-in effect per word
Position: Bottom 20% of screen

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
src		src
.env.example		.env.example
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
package-lock.json		package-lock.json
package.json		package.json
remotion.config.ts		remotion.config.ts
tsconfig.json		tsconfig.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

TikTok-Style Video Captioning

Requirements

System Dependencies

API Keys

Automatic Dependencies

Installation

Usage

Quick Start

Individual Scripts

Preview

Project Structure

Workflow Details

Step 1: Transcription

Step 2: Filler Analysis

Step 3: Emphasis Detection

Step 4: Generate Timing

Step 5: Cut Video

Step 6: Render

Configuration

Caption Styling

License

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

License

ozten/CaptionsPlease

Folders and files

Latest commit

History

Repository files navigation

TikTok-Style Video Captioning

Requirements

System Dependencies

API Keys

Automatic Dependencies

Installation

Usage

Quick Start

Individual Scripts

Preview

Project Structure

Workflow Details

Step 1: Transcription

Step 2: Filler Analysis

Step 3: Emphasis Detection

Step 4: Generate Timing

Step 5: Cut Video

Step 6: Render

Configuration

Caption Styling

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages