Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Breaks on vtt file downloaded from YouTube #439

Closed
zellyn opened this issue Feb 7, 2025 · 9 comments · Fixed by #440
Closed

Breaks on vtt file downloaded from YouTube #439

zellyn opened this issue Feb 7, 2025 · 9 comments · Fixed by #440
Assignees
Labels
bug Something isn't working c-vtt-reader

Comments

@zellyn
Copy link

zellyn commented Feb 7, 2025

Version: whatever uvx fetches by default right now, presumably 1.1.1.

I'm having trouble parsing a vtt file downloaded from YouTube, using a URL that yt-dlp gave me.

The full file can be fetched with:

curl -o en.vtt 'https://www.youtube.com/api/timedtext?v=qRBLpBxkldc&ei=A3alZ8TiCrrFy_sPsqXh2Qw&caps=asr&opi=112496729&xoaf=5&hl=en&ip=0.0.0.0&ipbits=0&expire=1738922099&sparams=ip%2Cipbits%2Cexpire%2Cv%2Cei%2Ccaps%2Copi%2Cxoaf&signature=CDC65F7C2C9C910453D8D6CB48688281B95B3FB4.B99D5AA37DE09A33776B24C1EA4B43D88909E285&key=yt8&kind=asr&lang=en&fmt=vtt'

This snippet of just the first few lines is enough to provoke the problem:

WEBVTT
Kind: captions
Language: en

00:00:00.799 --> 00:00:02.869 align:start position:0%

hi<00:00:01.040><c> everyone</c><00:00:01.920><c> today</c><00:00:02.240><c> we're</c><00:00:02.399><c> going</c><00:00:02.639><c> to</c><00:00:02.720><c> be</c>

00:00:02.869 --> 00:00:02.879 align:start position:0%
hi everyone today we're going to be


00:00:02.879 --> 00:00:04.070 align:start position:0%
hi everyone today we're going to be
modeling<00:00:03.360><c> a</c><00:00:03.520><c> basic</c>

00:00:04.070 --> 00:00:04.080 align:start position:0%
modeling a basic

Here's my snippet of code:

#!bin/uv run
# /// script
# requires-python = ">=3.12"
# dependencies = [
#     "ttconv",
# ]
# ///

import ttconv.vtt.reader as vtt_reader

def parse_vtt(filename):
    with open(filename, encoding="utf-8") as f:
        m = vtt_reader.to_model(f)

parse_vtt('en-minimal.vtt')

And here's the error I get:

Traceback (most recent call last):
  File "/Users/zellyn/gh/jogwheel/./parse_subtitles.py", line 15, in <module>
    parse_vtt('en-minimal.vtt')
    ~~~~~~~~~^^^^^^^^^^^^^^^^^^
  File "/Users/zellyn/gh/jogwheel/./parse_subtitles.py", line 13, in parse_vtt
    m = vtt_reader.to_model(f)
  File "/Users/zellyn/.cache/uv/archive-v0/gFjA1NVZI1iqd7-xVXdBR/lib/python3.13/site-packages/ttconv/vtt/reader.py", line 527, in to_model
    subtitle_text = subtitle_text.strip('\r\n').replace(r"\n\r", "\n")
                    ^^^^^^^^^^^^^
UnboundLocalError: cannot access local variable 'subtitle_text' where it is not associated with a value

Note: if I edit .../vtt/reader.py and add a line with subtitle_text = "" after the line current_p = None, it succeeds in parsing this snippet. However, if I run it on the full vtt file, it starts disliking the timestamps:

Invalid timestamp tag 00:07:09.759
Invalid timestamp tag 00:07:09.919
Invalid timestamp tag 00:07:10.319
# etc

[Edit]

Hmmm. That URL appears to have a time limit. I'll attach the full vtt file downloaded with:

curl -o en.vtt 'https://www.youtube.com/api/timedtext?v=qRBLpBxkldc&ei=yBCmZ8mzJubZy_sP0Pm1oQ8&caps=asr&opi=112496729&xoaf=5&hl=en&ip=0.0.0.0&ipbits=0&expire=1738961720&sparams=ip%2Cipbits%2Cexpire%2Cv%2Cei%2Ccaps%2Copi%2Cxoaf&signature=3BCD9784BD0F9388254C0812A3E25DAB7D780766.CF516C7D2D2623AD8318EE013DF636A3AF9E947C&key=yt8&kind=asr&lang=en&fmt=vtt'

and renamed to make GitHub happy: en.vtt.txt

@palemieux
Copy link
Contributor

If I read the VTT specification correctly, the extra line between 00:00:00.799 --> 00:00:02.869 align:start position:0% and hi is prohibited.

@zellyn
Copy link
Author

zellyn commented Feb 7, 2025

To some extent, though, the tacit spec is somewhere between the official spec and whatever YouTube does… 😔

@palemieux
Copy link
Contributor

FYI. I was pointed to the following validator:

https://w3c.github.io/webvtt.js/parser.html

@palemieux
Copy link
Contributor

I think allowing an extra line between the cue timing line and the first line of cue contents significantly complicates the parser, which would need to do some look ahead to differentiate between cue contents and cue identifier.

Ideally someone from YT could weigh in.

@palemieux
Copy link
Contributor

@zellyn See #440 . The proposed fix avoids crashing and burning, and instead ignores the cue.

@palemieux palemieux self-assigned this Feb 7, 2025
@palemieux palemieux added bug Something isn't working c-vtt-reader labels Feb 7, 2025
@palemieux
Copy link
Contributor

palemieux commented Feb 8, 2025

P.S.: I have asked for input on both the W3C TT WG reflector and the CCSUBS reflector. I plan to merge the PR sometime late next week unless I hear otherwise.

@palemieux
Copy link
Contributor

palemieux commented Feb 9, 2025

@zellyn It looks like en.vtt.txt includes a single space character after the timing line, but the inline snippet does not. Can you confirm that the YT download includes that single space?

@zellyn
Copy link
Author

zellyn commented Feb 10, 2025

Oh, good catch! Yep, it appears to be there:

URL=$(yt-dlp --skip-download --dump-json 'https://www.youtube.com/watch?v=qRBLpBxkldc' | jq -r '.automatic_captions.en[] | select(.ext=="vtt") | .url')

curl -fsSL -o en.vtt $URL

xxd en.vtt | head -10
00000000: 5745 4256 5454 0a4b 696e 643a 2063 6170  WEBVTT.Kind: cap
00000010: 7469 6f6e 730a 4c61 6e67 7561 6765 3a20  tions.Language:
00000020: 656e 0a0a 3030 3a30 303a 3030 2e37 3939  en..00:00:00.799
00000030: 202d 2d3e 2030 303a 3030 3a30 322e 3836   --> 00:00:02.86
00000040: 3920 616c 6967 6e3a 7374 6172 7420 706f  9 align:start po
00000050: 7369 7469 6f6e 3a30 250a 200a 6869 3c30  sition:0%. .hi<0
00000060: 303a 3030 3a30 312e 3034 303e 3c63 3e20  0:00:01.040><c>
00000070: 6576 6572 796f 6e65 3c2f 633e 3c30 303a  everyone</c><00:
00000080: 3030 3a30 312e 3932 303e 3c63 3e20 746f  00:01.920><c> to
00000090: 6461 793c 2f63 3e3c 3030 3a30 303a 3032  day</c><00:00:02

You can see 0a 200a on line 00000050: newline, space, newline.

@palemieux
Copy link
Contributor

I have revised the PR to fix the detection of empty lines. Empty lines are those that contain no characters (other than \n\r),

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working c-vtt-reader
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants