Breaks on vtt file downloaded from YouTube #439

zellyn · 2025-02-07T13:54:23Z

Version: whatever uvx fetches by default right now, presumably 1.1.1.

I'm having trouble parsing a vtt file downloaded from YouTube, using a URL that yt-dlp gave me.

The full file can be fetched with:

curl -o en.vtt 'https://www.youtube.com/api/timedtext?v=qRBLpBxkldc&ei=A3alZ8TiCrrFy_sPsqXh2Qw&caps=asr&opi=112496729&xoaf=5&hl=en&ip=0.0.0.0&ipbits=0&expire=1738922099&sparams=ip%2Cipbits%2Cexpire%2Cv%2Cei%2Ccaps%2Copi%2Cxoaf&signature=CDC65F7C2C9C910453D8D6CB48688281B95B3FB4.B99D5AA37DE09A33776B24C1EA4B43D88909E285&key=yt8&kind=asr&lang=en&fmt=vtt'

This snippet of just the first few lines is enough to provoke the problem:

WEBVTT
Kind: captions
Language: en

00:00:00.799 --> 00:00:02.869 align:start position:0%

hi<00:00:01.040><c> everyone</c><00:00:01.920><c> today</c><00:00:02.240><c> we're</c><00:00:02.399><c> going</c><00:00:02.639><c> to</c><00:00:02.720><c> be</c>

00:00:02.869 --> 00:00:02.879 align:start position:0%
hi everyone today we're going to be


00:00:02.879 --> 00:00:04.070 align:start position:0%
hi everyone today we're going to be
modeling<00:00:03.360><c> a</c><00:00:03.520><c> basic</c>

00:00:04.070 --> 00:00:04.080 align:start position:0%
modeling a basic

Here's my snippet of code:

#!bin/uv run
# /// script
# requires-python = ">=3.12"
# dependencies = [
#     "ttconv",
# ]
# ///

import ttconv.vtt.reader as vtt_reader

def parse_vtt(filename):
    with open(filename, encoding="utf-8") as f:
        m = vtt_reader.to_model(f)

parse_vtt('en-minimal.vtt')

And here's the error I get:

Traceback (most recent call last):
  File "/Users/zellyn/gh/jogwheel/./parse_subtitles.py", line 15, in <module>
    parse_vtt('en-minimal.vtt')
    ~~~~~~~~~^^^^^^^^^^^^^^^^^^
  File "/Users/zellyn/gh/jogwheel/./parse_subtitles.py", line 13, in parse_vtt
    m = vtt_reader.to_model(f)
  File "/Users/zellyn/.cache/uv/archive-v0/gFjA1NVZI1iqd7-xVXdBR/lib/python3.13/site-packages/ttconv/vtt/reader.py", line 527, in to_model
    subtitle_text = subtitle_text.strip('\r\n').replace(r"\n\r", "\n")
                    ^^^^^^^^^^^^^
UnboundLocalError: cannot access local variable 'subtitle_text' where it is not associated with a value

Note: if I edit .../vtt/reader.py and add a line with subtitle_text = "" after the line current_p = None, it succeeds in parsing this snippet. However, if I run it on the full vtt file, it starts disliking the timestamps:

Invalid timestamp tag 00:07:09.759
Invalid timestamp tag 00:07:09.919
Invalid timestamp tag 00:07:10.319
# etc

[Edit]

Hmmm. That URL appears to have a time limit. I'll attach the full vtt file downloaded with:

curl -o en.vtt 'https://www.youtube.com/api/timedtext?v=qRBLpBxkldc&ei=yBCmZ8mzJubZy_sP0Pm1oQ8&caps=asr&opi=112496729&xoaf=5&hl=en&ip=0.0.0.0&ipbits=0&expire=1738961720&sparams=ip%2Cipbits%2Cexpire%2Cv%2Cei%2Ccaps%2Copi%2Cxoaf&signature=3BCD9784BD0F9388254C0812A3E25DAB7D780766.CF516C7D2D2623AD8318EE013DF636A3AF9E947C&key=yt8&kind=asr&lang=en&fmt=vtt'

and renamed to make GitHub happy: en.vtt.txt

The text was updated successfully, but these errors were encountered:

palemieux · 2025-02-07T17:48:09Z

If I read the VTT specification correctly, the extra line between 00:00:00.799 --> 00:00:02.869 align:start position:0% and hi is prohibited.

zellyn · 2025-02-07T17:52:05Z

To some extent, though, the tacit spec is somewhere between the official spec and whatever YouTube does… 😔

palemieux · 2025-02-07T18:02:13Z

FYI. I was pointed to the following validator:

https://w3c.github.io/webvtt.js/parser.html

palemieux · 2025-02-07T19:15:57Z

I think allowing an extra line between the cue timing line and the first line of cue contents significantly complicates the parser, which would need to do some look ahead to differentiate between cue contents and cue identifier.

Ideally someone from YT could weigh in.

palemieux · 2025-02-07T20:42:38Z

@zellyn See #440 . The proposed fix avoids crashing and burning, and instead ignores the cue.

palemieux · 2025-02-08T23:07:54Z

P.S.: I have asked for input on both the W3C TT WG reflector and the CCSUBS reflector. I plan to merge the PR sometime late next week unless I hear otherwise.

palemieux · 2025-02-09T20:07:51Z

@zellyn It looks like en.vtt.txt includes a single space character after the timing line, but the inline snippet does not. Can you confirm that the YT download includes that single space?

zellyn · 2025-02-10T14:47:05Z

Oh, good catch! Yep, it appears to be there:

URL=$(yt-dlp --skip-download --dump-json 'https://www.youtube.com/watch?v=qRBLpBxkldc' | jq -r '.automatic_captions.en[] | select(.ext=="vtt") | .url')

curl -fsSL -o en.vtt $URL

xxd en.vtt | head -10
00000000: 5745 4256 5454 0a4b 696e 643a 2063 6170  WEBVTT.Kind: cap
00000010: 7469 6f6e 730a 4c61 6e67 7561 6765 3a20  tions.Language:
00000020: 656e 0a0a 3030 3a30 303a 3030 2e37 3939  en..00:00:00.799
00000030: 202d 2d3e 2030 303a 3030 3a30 322e 3836   --> 00:00:02.86
00000040: 3920 616c 6967 6e3a 7374 6172 7420 706f  9 align:start po
00000050: 7369 7469 6f6e 3a30 250a 200a 6869 3c30  sition:0%. .hi<0
00000060: 303a 3030 3a30 312e 3034 303e 3c63 3e20  0:00:01.040><c>
00000070: 6576 6572 796f 6e65 3c2f 633e 3c30 303a  everyone</c><00:
00000080: 3030 3a30 312e 3932 303e 3c63 3e20 746f  00:01.920><c> to
00000090: 6461 793c 2f63 3e3c 3030 3a30 303a 3032  day</c><00:00:02

You can see 0a 200a on line 00000050: newline, space, newline.

palemieux · 2025-02-10T18:29:14Z

I have revised the PR to fix the detection of empty lines. Empty lines are those that contain no characters (other than \n\r),

#439

palemieux mentioned this issue Feb 7, 2025

VTT reader: fix empty line detection #440

Merged

palemieux self-assigned this Feb 7, 2025

palemieux added bug Something isn't working c-vtt-reader labels Feb 7, 2025

palemieux closed this as completed in #440 Feb 12, 2025

palemieux added a commit that referenced this issue Feb 12, 2025

VTT reader: fix empty line detection

3562f7b

#439

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Breaks on vtt file downloaded from YouTube #439

Breaks on vtt file downloaded from YouTube #439

zellyn commented Feb 7, 2025 •

edited

Loading

palemieux commented Feb 7, 2025

zellyn commented Feb 7, 2025

palemieux commented Feb 7, 2025

palemieux commented Feb 7, 2025

palemieux commented Feb 7, 2025

palemieux commented Feb 8, 2025 •

edited

Loading

palemieux commented Feb 9, 2025 •

edited

Loading

zellyn commented Feb 10, 2025

palemieux commented Feb 10, 2025

Breaks on vtt file downloaded from YouTube #439

Breaks on vtt file downloaded from YouTube #439

Comments

zellyn commented Feb 7, 2025 • edited Loading

palemieux commented Feb 7, 2025

zellyn commented Feb 7, 2025

palemieux commented Feb 7, 2025

palemieux commented Feb 7, 2025

palemieux commented Feb 7, 2025

palemieux commented Feb 8, 2025 • edited Loading

palemieux commented Feb 9, 2025 • edited Loading

zellyn commented Feb 10, 2025

palemieux commented Feb 10, 2025

zellyn commented Feb 7, 2025 •

edited

Loading

palemieux commented Feb 8, 2025 •

edited

Loading

palemieux commented Feb 9, 2025 •

edited

Loading