-
Notifications
You must be signed in to change notification settings - Fork 11
/
index.html
1160 lines (1159 loc) · 54.9 KB
/
index.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
<!DOCTYPE html>
<html>
<head>
<title>Requirements for Media Timed Events</title>
<meta charset="utf-8">
<script src="https://www.w3.org/Tools/respec/respec-w3c" class="remove" defer></script>
<script class="remove">
var respecConfig = {
specStatus: "IG-NOTE",
edDraftURI: "https://w3c.github.io/me-media-timed-events/",
shortName: "media-timed-events",
editors: [
{
name: "Chris Needham",
mailto: "chris.needham@bbc.co.uk",
company: "British Broadcasting Corporation",
companyURL: "https://www.bbc.co.uk"
},
],
formerEditors: [
{
name: "Giridhar Mandyam",
company: "Qualcomm",
note: "until December 2018"
}
],
group: "me",
charterDisclosureURI: "https://www.w3.org/2019/06/me-ig-charter.html#patentpolicy",
github: {
repoURL: "https://github.com/w3c/me-media-timed-events/",
branch: "master"
},
localBiblio: {
"ID3-EMSG": {
title: "Carriage of ID3 Timed Metadata in the Common Media Application Format (CMAF)",
href: "https://aomediacodec.github.io/id3-emsg/",
authors: [
"Krasimir Kolarov",
"John Simmons"
],
publisher: "AOM",
date: "12 March 2020"
},
"DASH-EVENTING": {
title: "DASH Eventing and HTML5",
href: "https://www.w3.org/2011/webtv/wiki/images/a/a5/DASH_Eventing_and_HTML5.pdf",
authors: [
"Giridhar Mandyam"
],
date: "February 2018"
},
"DASHIF-EVENTS": {
title: "DASH Player’s Application Events and Timed Metadata Processing Models and APIs",
href: "https://dashif-documents.azurewebsites.net/Events/master/event.html",
publisher: "DASH Industry Forum",
date: "3 July 2019"
},
"NORDIG": {
title: "NorDig Unified Requirements for Integrated Receiver Decoders for use in cable, satellite, terrestrial and managed IPTV based networks",
href: "https://nordig.org/wp-content/uploads/2018/10/NorDig-Unified-Requirements-ver.-3.1.pdf",
publisher: "NorDig",
date: "27 October 2018"
}
}
};
</script>
</head>
<body>
<section id="abstract">
<p>
This document collects use cases and requirements for improved support
for timed events related to audio or video media on the web, where
synchronization to a playing audio or video media stream is needed,
and makes recommendations for new or changed web APIs to realize these
requirements. The goal is to extend the existing support in HTML for
text track cues to add support for dynamic content replacement cues and
generic data cues that drive synchronized interactive media experiences,
and improve the timing accuracy of rendering of web content intended to
be synchronized with audio or video media playback.
</p>
</section>
<section id="sotd">
<p>
The Media & Entertainment Interest Group may update these
use cases and requirements over time. Development of new web APIs based
on the requirements described here, for example, <code>DataCue</code>,
will proceed in the <a href="https://wicg.io/">Web Platform
Incubator Community Group (WICG)</a>, with the goal of eventual
standardization within a W3C Working Group. Contributors to this
document are encouraged to participate in the WICG. Where the
requirements described here affect the HTML specification, contributors
will follow up with <a href="https://whatwg.org/">WHATWG</a>. The Interest
Group will continue to track these developments and provide input and
review feedback on how any proposed API meets these requirements.
</p>
</section>
<section>
<h2>Introduction</h2>
<p>
There is a need in the media industry for an API to support arbitrary
data associated with points in time or periods of time in a continuous
media (audio or video) presentation. This data may include:
</p>
<ul>
<li>
metadata that describes the content in some way, such as program or
chapter titles, geolocation information, often referred to as <em>timed
metadata</em>; or
</li>
<li>
control messages for the media player that are expected to take effect
at specific times during media playback, such as ad insertion cues.
</li>
</ul>
<p>
For the purpose of this document, we refer to these collectively as
<a>media timed events</a>. These events can be used to carry
information intended to be synchronized with the media stream, used to
support use cases such as dynamic content replacement, ad insertion,
presentation of supplemental content alongside the audio or video, or
more generally, making changes to a web page, or executing application
code triggered at specific points on the <a>media timeline</a> of an
audio or video media stream.
</p>
<p>
<a>Media timed events</a> may be carried either <a>in-band</a>, meaning
that they are delivered within the audio or video media container or
multiplexed with the media stream, or <a>out-of-band</a>, meaning that
they are delivered externally to the media container or media stream.
</p>
<p>
This document describes use cases and requirements that go beyond the
existing support for timed text, using <a>TextTrack</a> and related
APIs.
</p>
</section>
<section>
<h2>Terminology</h2>
<p>
The following terms are used in this document:
</p>
<ul>
<li>
<dfn data-lt="media timed event">media timed event</dfn> —
information that is associated with a point in time or a period of
time, synchronized to the <a>media timeline</a> of a <a>media
resource</a>.
</li>
<li>
<dfn>in-band</dfn> — <a>media timed event</a> data that is
delivered within the audio or video media container or multiplexed
with the media stream.
</li>
<li>
<dfn>out-of-band</dfn> — <a>media timed event</a> data that is
delivered over some other mechanism external to the media container or
media stream.
</li>
</ul>
<p>
The following terms are defined in [[HTML]]:
</p>
<ul>
<li>
<dfn data-cite="HTML/media.html#media-element">media element</dfn>
</li>
<li>
<dfn data-cite="HTML/media.html#media-timeline">media timeline</dfn>
</li>
<li>
<dfn data-cite="HTML/media.html#media-resource">media resource</dfn>
</li>
<li>
<dfn data-cite="HTML/media.html#time-marches-on">time marches on</dfn>
</li>
<li>
<dfn data-cite="HTML/media.html#dom-texttrack-activecues"><code>activeCues</code></dfn>
</li>
<li>
<dfn data-cite="HTML/media.html#dom-media-currenttime"><code>currentTime</code></dfn>
</li>
<li>
<dfn data-cite="HTML/media.html#event-media-enter"><code>enter</code></dfn>
</li>
<li>
<dfn data-cite="HTML/media.html#event-media-exit"><code>exit</code></dfn>
</li>
<li>
<dfn data-cite="HTML/media.html#handler-texttrack-oncuechange"><code>oncuechange</code></dfn>
</li>
<li>
<dfn data-cite="HTML/media.html#handler-texttrackcue-onenter"><code>onenter</code></dfn>
</li>
<li>
<dfn data-cite="HTML/media.html#handler-texttrackcue-onexit"><code>onexit</code></dfn>
</li>
<li>
<dfn data-cite="HTML/media.html#texttrack"><code>TextTrack</code></dfn>
</li>
<li>
<dfn data-cite="HTML/media.html#texttrackcue"><code>TextTrackCue</code></dfn>
</li>
<li>
<dfn data-cite="HTML/media.html#event-media-timeupdate"><code>timeupdate</code></dfn>
</li>
<li>
<dfn data-cite="HTML/timers-and-user-prompts.html#dom-settimeout"><code>setTimeout()</code></dfn>
</li>
<li>
<dfn data-cite="HTML/timers-and-user-prompts.html#dom-setinterval"><code>setInterval()</code></dfn>
</li>
<li>
<dfn data-cite="HTML/imagebitmap-and-animations.html#dom-animationframeprovider-requestanimationframe"><code>requestAnimationFrame()</code></dfn>
</li>
</ul>
<p>
The following term is defined in [[HR-TIME]]:
</p>
<ul>
<li>
<dfn data-cite="HR-TIME#dom-performance-now"><code>Performance.now()</code></dfn>
</li>
</ul>
<p>
The following term is defined in [[WEBVTT]]:
</p>
<ul>
<li>
<dfn data-cite="WEBVTT#vttcue"><code>VTTCue</code></dfn>
</li>
</ul>
</section>
<section>
<h2>Use cases</h2>
<p>
<a>Media timed events</a> carry information that is related to points in
time or periods of time on the <a>media timeline</a>, which can be used
to trigger retrieval and/or rendering of web resources synchronized with
media playback. Such resources can be used to enhance user experience
in the context of media that is being rendered. Some examples include
display of social media feeds corresponding to a live video stream such
as a sporting event, banner advertisements for sponsored content,
accessibility-related assets such as large print rendering of
captions.
</p>
<p>
The following sections describe a few use cases in more detail.
</p>
<section id="dynamic-content-insertion">
<h3>Dynamic content insertion</h3>
<p>
A media content provider wants to allow insertion of content, such as
personalised video, local news, or advertisements, into a video media
stream that contains the main program content. To achieve this,
<a>media timed events</a> can be used to describe the points on the
<a>media timeline</a>, known as splice points, where switching
playback to inserted content is possible.
</p>
<p>
The Society for Cable and Televison Engineers (SCTE) specification
"Digital Program Insertion Cueing for Cable" [[SCTE35]] defines a data
cue format for describing such insertion points. Use of these cues in
MPEG-DASH and HLS streams is described in [[SCTE35]], sections 12.1
and 12.2.
</p>
<p>
This use case typically requires frame accuracy, so that inserted
content is played at the right time, and continuous playback is
maintained.
</p>
</section>
<section>
<h3>Audio stream with titles and images</h3>
<p>
A media content provider wants to provide visual information alongside
an audio stream, such as an image of the artist and title of the
current playing track, to give users live information about the
content they are listening to.
</p>
<p>
HLS timed metadata [[HLS-TIMED-METADATA]] uses
<a>in-band</a> ID3 metadata to carry the artist and title information,
and image content. RadioVIS in DVB ([[DVB-DASH]], section 9.1.7)
defines <a>in-band</a> event messages that contain image URLs and text
messages to be displayed, with information about when the content
should be displayed in relation to the <a>media timeline</a>.
</p>
<p>
The visual information should be rendered within a hundred milliseconds
or so to maintain good synchronization with the audio content.
</p>
</section>
<section>
<h3>Control messages for media streaming clients</h3>
<p>
MPEG-DASH defines a number of control messages for media streaming
clients (e.g., libraries such as <a href="https://github.com/Dash-Industry-Forum/dash.js">dash.js</a>).
These messages are carried <a>in-band</a> in the media container
files. Use cases include:
</p>
<ul>
<li>
Signalling that the DASH manifest document (MPD) has expired and
should be updated. This method is used as an alternative to setting
a cache duration in the response to the HTTP request for the
manifest, so the client can refresh the manifest document when it
actually changes, as opposed to waiting for a cache duration expiry
period to elapse. This also has the benefit of reducing the load on
HTTP servers caused by frequent server requests.
</li>
<li>
Analytics callback events, to allow content providers to track
media playback. In response to this message, to the client
makes an HTTP request to a URL specified in the <a>media timed
event</a> data.
</li>
<li>
Signalling early termination of the media presentation, for cases
where the media ends earlier than expected from the current DASH
manifest document.
</li>
</ul>
<p>
Reference: M&E IG call 1 Feb 2018:
<a href="https://www.w3.org/2018/02/01-me-minutes.html">Minutes</a>,
[[DASH-EVENTING]].
</p>
</section>
<section>
<h3>Subtitle and caption rendering synchronization</h3>
<p>
A subtitle or caption author wants ensure that subtitle changes are
aligned as closely as possible to shot changes in the video.
The BBC Subtitle Guidelines [[BBC-SUBTITLE]] describes authoring
best practices. In particular, in section 6.1 authors are advised:
</p>
<blockquote>
"[...] it is likely to be less tiring for the viewer if shot changes
and subtitle changes occur at the same time. Many subtitles therefore
start on the first frame of the shot and end on the last frame."
</blockquote>
<p>
The NorDig technical specifications for DVB receivers for the Nordic
and Irish markets [[NORDIG]], section 7.3.1, mandates that receivers
support TTML in MPEG-2 Transport Streams. The presentation timing
precision for subtitles is specified as being within 2 frames.
</p>
<p>
Another important use case is maintaining synchronization of subtitles
during program content with fast dialog. The BBC Subtitle Guidelines,
section 5.1 says:
</p>
<blockquote>
"Impaired viewers make use of visual cues from the
faces of television speakers. Therefore subtitle appearance should
coincide with speech onset. [...] When two or more people are
speaking, it is particularly important to keep in sync. Subtitles for
new speakers must, as far as possible, come up as the new speaker
starts to speak. Whether this is possible will depend on the action on
screen and rate of speech."
</blockquote>
<p>
A very fast word rate, for example, 240 words per minute, corresponds
on average to one word every 250 milliseconds.
</p>
</section>
<section>
<h3>Synchronized map animations</h3>
<p>
A user records footage with metadata, including geolocation, on a
mobile video device, e.g., drone or dashcam, to share on the web
alongside a map, e.g., OpenStreetMap.
</p>
<p>
[[WEBVMT]] is an open format for metadata cues, synchronized with a
timed media file, that can be used to drive an online map rendered in
a separate HTML element alongside the <a>media element</a> on the web page.
The media playhead position controls presentation and animation of the
map, e.g., pan and zoom, and allows annotations to be added and
removed, e.g., markers, at specified times during media playback.
Control can also be overridden by the user with the usual interactive
features of the map at any time, e.g., zoom. The rendering of the map
animation and annotations should usually be to within a hundred
milliseconds or so to maintain good synchronization with the video.
However, a shot change which instantly moves to a different location
would require the map to be updated simultaneously, ideally with frame
accuracy.
</p>
<p>
Concrete examples are provided by the
<a href="http://webvmt.org/demos">tech demos</a> at the WebVMT website.
</p>
</section>
<section>
<h3>Media stream with video and synchronized graphics</h3>
<p>
A content provider wants to provide synchronized graphical elements
that may be rendered next to or on top of a video.
</p>
<p>
For example, in a talk show this could be a banner, shown in the lower
third of the video, that displays the name of the guest. In a sports
event, the graphics could show the latest lap times or current score,
or highlight the location of the current active player. It could even
be a full-screen overlay, to blend from one part of the program to
another.
</p>
<p>
The graphical elements are described in a stream or file containing
<a>media timed events</a> for start and end time of each graphical
element, similar to a subtitle stream or file. A graphic renderer
takes this data as input and renders it on top of the video image
according to the <a>media timed events</a>.
</p>
<p>
The purpose of rendering the graphical elements on the client device,
rather than rendering them directly into the video image, is to allow
the graphics to be optimized for the device's display parameters,
such as aspect ratio and orientation. Another use case is adapting
to user preferences, for localization or to improve accessibility.
</p>
<p>
This use case requires frame accurate synchronization of the content
being rendered over the video.
</p>
</section>
<section>
<h3>Live event coverage</h3>
<p>
Media content providers often cover live events where the timing
of particular segments, although often pre-scheduled, can be subject
to last minute change, or may not be known ahead of time.
</p>
<p>
The media content provider uses media timed events together with their
video stream to add metadata to annotate the start and (where known)
end times of each of these segments. This metadata drives a user
interface that allows users to see information about the current
playing and upcoming segments.
</p>
<p>
Examples of the dynamic nature of the timing include:
</p>
<ul>
<li>
A baseball game, where the total duration is not known in advance,
but after the game ends is followed by a period of known
duration for post-game interviews.
</li>
<li>
An award show, which is scheduled for a given duration, runs over
time and so the exact end time becomes unknown.
</li>
<li>
A regularly scheduled one-hour news program, which is extended by
30 minutes to cover a breaking news story, or is cut short for
an unscheduled government announcement.
</li>
</ul>
</section>
<section>
<h3>Presentation of auxiliary content in live media</h3>
<p>
During a live media presentation, dynamic and unpredictable events may
occur which cause temporary suspension of the media presentation.
During that suspension interval, auxiliary content such as the
presentation of UI controls and media files, may be unavailable.
Depending on the specific user engagement (or not) with the UI
controls and the time at which any such engagement occurs, specific
web resources may be rendered at defined times in a synchronized
manner. For example, a multimedia A/V clip along with subtitles
corresponding to an advertisement, and which were previously
downloaded and cached by the UA, are played out.
</p>
</section>
</section>
<section>
<h2>Related industry specifications</h2>
<p>
This section describes existing media industry specifications and
standards that specify carriage of <a>media timed events</a>, or
otherwise provide requirements for web APIs related to the triggering of
DOM events synchronized with the <a>media timeline</a>.
</p>
<section>
<h3>MPEG Common Media Application Format (CMAF)</h3>
<p>
MPEG Common Media Application Format (CMAF) [[MPEGCMAF]] is a media
container format optimized for large scale delivery of a single
encrypted, adaptable multimedia presentation to a wide range of
devices and adaptive streaming methods, including HTTP Live Streaming
[[RFC8216]] and MPEG-DASH [[MPEGDASH]]. It is based on the ISO BMFF
[[ISOBMFF]] and supports the AVC, AAC, HEVC codecs, Common Encryption
(CENC), and subtitles using IMSC1 and WebVTT. Its goal is to reduce
media storage and delivery costs by using a single common media format
across different client devices.
</p>
<p>
CMAF media may contain <a>in-band</a> <a>media timed events</a> in the
form of Event Message (<code>emsg</code>) boxes in ISO BMFF files.
<code>emsg</code> is specified in [[MPEGDASH]], section 5.10.3.3,
and described in more detail in the following section of this
document.
</p>
</section>
<section>
<h3>MPEG-DASH</h3>
<p>
MPEG-DASH is an adaptive bitrate streaming technique in which the
audio and video media is partitioned into segments. The Media
Presentation Description (MPD) is an XML document that contains
metadata required by a DASH client to access the media segments and to
provide the streaming service to the user. The media segments can use
any codec, typically within a fragmented MP4 (ISO BMFF) container or
MPEG-2 transport stream.
</p>
<p>
In MPEG-DASH, <a>media timed events</a> may be delivered either
<a>in-band</a> or <a>out-of-band</a>:
</p>
<ul>
<li>
<a>In-band</a> <a>media timed events</a> are delivered via "event
message" (<code>emsg</code>) boxes in ISO BMFF files. The presence
of <code>emsg</code> events in the media container for given event
schemes is signaled in the MPD document using an
<code>EventStream</code> XML element ([[MPEGDASH]], section 5.10.2).
</li>
<li>
<a>Out-of-band</a> <a>media timed events</a> are delivered via
<code>Event</code> XML elements </code>contained within an
<code>EventStream</code> element in the MPD.
</li>
</ul>
<p>
An <code>emsg</code> event contains the following information,
as specified in [[MPEGDASH]], section 5.10.3.3:
</p>
<ul>
<li><code>id</code> — Event message identifier</li>
<li><code>scheme_id_uri</code> — A URI that identifies
the message scheme</li>
<li><code>value</code> — The event value (string)</li>
<li><code>timescale</code> — Timescale units, in ticks
per second</li>
<li><code>presentation_time_delta</code> — Presentation
time delta (with respect to the media segment),
in <code>timescale</code> units</li>
<li><code>event_duration</code> — Event duration,
in <code>timescale</code> units</li>
<li><code>message_data</code> — Message body (may be empty)</li>
</ul>
</section>
<section>
<h3>HTTP Live Streaming</h3>
<p>
HTTP Live Streaming (HLS) allows for delivery of timed metadata
events, both <a>in-band</a> and <a>out-of-band</a>:
</p>
<ul>
<li>
HLS supports delivery of ID3 timed metadata carried <a>in-band</a>
within MPEG-2 Transport Streams [[HLS-TIMED-METADATA]]. ID3 metadata
is stored as a complete ID3v2.4 frame in an elementary stream (PES)
packet, including a complete ID3 header [[ID3v2]].
</li>
<li>
<a>Out-of-band</a> events are delivered using the
<code>EXT-X-DATERANGE</code> tag in a HLS playlist file.
</li>
</ul>
<p>
An <code>EXT-X-DATERANGE</code> tag contains the following information, as specified in
[[RFC8216]], section 4.3.2.7:
</p>
<ul>
<li><code>ID</code> — Unique event identifier</li>
<li><code>CLASS</code> — An identifier that indicates the event semantics. All events with the same <code>CLASS</code> have the same semantics</li>
<li><code>START-DATE</code> — The date and time of the start of the event</li>
<li><code>END-DATE</code> — The date and time of the end of the event (optional)</li>
<li><code>DURATION</code> — Event duration, in seconds (optional). May be zero</li>
<li><code>PLANNED-DURATION</code> — Expected event duration, where the actual duration is not yet known (optional)</li>
<li><code>END-ON-NEXT</code> — Indicates that the event automatically ends when the next event of the same <code>CLASS</code> starts. If present, this attribute replaces <code>END-DATE</code> and <code>DURATION</code></li>
<li><code>X-<client-attribute></code> — Allows the application to define its own key/value metadata pairs</li>
</ul>
<p>
For interoperability between HLS and CMAF, The Alliance for Open Media
has published [[ID3-EMSG]], which specifies how to include ID3 metadata
in <code>emsg</code> boxes.
</p>
</section>
<section>
<h3>HbbTV</h3>
<p>
HbbTV is an interactive TV application standard that supports both
broadcast (DVB) media delivery, and internet streaming using
MPEG-DASH. The HbbTV application environment is based on HTML and
JavaScript. MPEG-DASH streaming is implemented natively by the user
agent, rather than through a JavaScript web application using Media
Source Extensions.
</p>
<p>
HbbTV includes support for <code>emsg</code> events ([[DVB-DASH]],
section 9.1) and requires this be mapped to HTML5 <code>DataCue</code>
([[HBBTV]], section 9.3.2). The revision of HTML5 referenced
by [[HBBTV]] is [[html51-20151008]]. This feature is included in user
agents shipping in connected TVs across Europe from 2017.
</p>
<p>
The <a href="https://www.hbbtv.org/wp-content/uploads/2018/03/HbbTV-testcases-2018-1.pdf">HbbTV
device test suite</a> includes test pages and streams that
cover <code>emsg</code> support. HbbTV has a
<a href="https://github.com/HbbTV-Association/ReferenceApplication">reference application</a>
and content for DASH+DRM which includes <code>emsg</code> support.
</p>
</section>
<section>
<h3>DASH Industry Forum APIs for Interactivity</h3>
<p>
The DASH-IF InterOp Working Group has an ongoing work item,
<em>DAInty</em>, "DASH APIs for Interactivity", which aims to specify
a set of APIs between the DASH client/player and interactivity-capable
applications, for both web and native applications [[DASHIFIOP]]. The
origin of this work is a related
<a href="http://www.3gpp.org/ftp/tsg_sa/TSG_SA/TSGS_77/Docs/SP-170796.zip">3GPP
work item</a> on Service Interactivity [[3GPP-INTERACTIVITY]].
The objective is to provide service enablers for user engagement with
auxiliary content and UIs on mobile device during live or time-shifted
viewing of streaming content delivered over 3GPP broadcast or unicast
bearers, and the measurement and reporting of such interactive
consumption.
</p>
<p>
Two APIs are being developed that are relevant to the scope of the
present document:
</p>
<ul>
<li>
Application subscription/DASH client dispatch of DASH event stream
messages containing interactivity information. Events can be delivered
<a>in-band</a> (<code>emsg</code>) and/or as MPD events.
</li>
<li>
Application subscription/DASH client dispatch of ISO BMFF Timed
Metadata tracks providing similar functionality to DASH event streams.
</li>
</ul>
<p>
Two modes for dispatching events are defined [[DASHIF-EVENTS]].
In <em>on-receive</em> mode, events are dispatched at the
time the event arrives, and in <em>on-start</em> mode, events are
dispatched at the given time on the <a>media timeline</a>. The
"arrival" of events from the DASH client perspective may be either
static or pre-provisioned, in the case MPD Events, or dynamic in the
case of <a>in-band</a> events carried in <code>emsg</code> boxes. The
application can register with the DASH client which mode to use.
</p>
</section>
<section>
<h3>SCTE-35</h3>
<p>
The Society for Cable and Television Engineers (SCTE) has produced the
SCTE-35 specification "Digital Program Insertion Cueing for Cable"
[[SCTE35]], which defines a data cue format for describing insertion
points, to support the <a href="#dynamic-content-insertion">dynamic
content insertion</a> use case.
</p>
<p>
[[SCTE214-1]] section 6.7 describes the carriage of SCTE-35 events
as <a>out-of-band</a> events in a MPEG-DASH MPD document.
[[SCTE214-2]] section 9 and [[SCTE214-3]] section 7.3 describe
the carriage of SCTE-35 events as <a>in-band</a> events in MPEG-DASH
using MPEG2-TS and ISO BMFF respectively, using <code>emsg</code>.
</p>
<p>
[[RFC8216]] section 4.3.2.7.1 specifies how to map SCTE-35 events into
HLS timed metadata, using the <code>EXT-X-DATERANGE</code> tag, with
<code>SCTE35-CMD</code>, <code>SCTE35-OUT</code>, and
<code>SCTE35-IN</code> attributes.
</p>
[[SCTE35]] section 9.1 describes the requirements for content
splicing: "In order to give advance warning of the impending splice
(a pre-roll function), the splice_insert() command could be sent
multiple times before the splice point. For example, the
splice_insert() command could be sent at 8, 5, 4 and 2 seconds prior
to the packet containing the related splice point. In order to meet
other splicing deadlines in the system, any message received with less
than 4 seconds of advance notice may not create the desired result."
</p>
<p>
This places an implicit requirement on the user agent in handling of
event synchronization related to insertion cues. The content originator
may provide the cue in advance with as little as 2 seconds of the
insertion time. Therefore the propagation of the event data associated
with the insertion cue to the application by the user agent should be
considerably less than 2 seconds.
</p>
</section>
<section>
<h3>MPEG Carriage of Web Resources in ISO BMFF</h3>
<p>
MPEG Carriage of Web Resources in ISO BMFF [[iso23001-15]] specifies
the use of the ISO BMFF container format for the storage and delivery
of web content. The goal is to allow web resources (HTML, CSS, etc.)
to be parsed from the storage and processed by a user agent at
specific presentation times on the <a>media timeline</a>, and so be
synchronized with other tracks within the container, such as audio,
video, and subtitles.
</p>
<p>
The Media & Entertainment Interest Group is actively tracking
this work and is open to discussing specific requirements for
synchronized rendering of <a>in-band</a> delivered web resources, as
development progresses.
</p>
</section>
<section>
<h3>WebVTT</h3>
<p>
[[WEBVTT]] is a W3C specification that provides a format for web video
text tracks. A <a>VTTCue</a> is a text track cue, and may have
attributes that affect rendering of the cue text on a web page.
WebVTT metadata cues are text that is aligned to the
<a>media timeline</a>. Web applications can use <a>VTTCue</a>
to carry arbitrary data by serializing the data to a string format
(JSON, for example) when creating the cue, and deserializing the data
when the cue's <code>onenter</code> DOM event is fired.
</p>
<p>
Web applications can also use <a>VTTCue</a> to trigger
rendering of <a>out-of-band</a> delivered timed text cues, such as
TTML or IMSC format captions.
</p>
</section>
</section>
<section>
<h2>Gap analysis</h2>
<p>
This section describes gaps in existing existing web platform
capabilities needed to support the use cases and requirements described
in this document. Where applicable, this section also describes how
existing web platform features can be used as workarounds, and any
associated limitations.
</p>
<section>
<h3>MPEG-DASH and ISO BMFF emsg events</h3>
<p>
The <code>DataCue</code> API has been previously discussed as a means
to deliver <a>in-band</a> <a>media timed event</a> data to web
applications, but this is not implemented in all of the main browser
engines. It is
<a href="https://www.w3.org/TR/2018/WD-html53-20181018/semantics-embedded-content.html#text-tracks-exposing-inband-metadata">included</a>
in the 18 October 2018 HTML 5.3 draft [[HTML53-20181018]], but is
<a href="https://html.spec.whatwg.org/multipage/media.html#timed-text-tracks">not included</a>
in [[HTML]]. See discussion <a href="https://groups.google.com/a/chromium.org/forum/#!topic/blink-dev/U06zrT2N-Xk">here</a>
and notes on implementation status <a href="https://lists.w3.org/Archives/Public/public-html/2016Apr/0005.html">here</a>.
</p>
<p>
WebKit <a href="https://discourse.wicg.io/t/media-timed-events-api-for-mpeg-dash-mpd-and-emsg-events/3096/2">supports</a>
a <code>DataCue</code> interface that extends HTML5 <code>DataCue</code>
with two attributes to support non-text metadata, <code>type</code> and
<code>value</code>.
</p>
<pre class="example">
interface DataCue : TextTrackCue {
attribute ArrayBuffer data; // Always empty
// Proposed extensions.
attribute any value;
readonly attribute DOMString type;
};
</pre>
<p>
<code>type</code> is a string identifying the type of metadata:
</p>
<table class="simple">
<thead>
<tr>
<th colspan="2">WebKit <code>DataCue</code> metadata types</th>
</tr>
</thead>
<tbody>
<tr>
<td><code>"com.apple.quicktime.udta"</code></td>
<td>QuickTime User Data</td>
</tr>
<tr>
<td><code>"com.apple.quicktime.mdta"</code></td>
<td>QuickTime Metadata</td>
</tr>
<tr>
<td><code>"com.apple.itunes"</code></td>
<td>iTunes metadata</td>
</tr>
<tr>
<td><code>"org.mp4ra"</code></td>
<td>MPEG-4 metadata</td>
</tr>
<tr>
<td><code>"org.id3"</code></td>
<td>ID3 metadata</td>
</tr>
</tbody>
</table>
<p>
and <code>value</code> is an object with the metadata item key, data,
and optionally a locale:
</p>
<pre class="example">
value = {
key: String
data: String | Number | Array | ArrayBuffer | Object
locale: String
}
</pre>
<p>
Neither [[MSE-BYTE-STREAM-FORMAT-ISOBMFF]] nor [[INBANDTRACKS]]
describe handling of <code>emsg</code> boxes.
</p>
<p>
On resource constrained devices such as smart TVs and streaming sticks,
parsing media segments to extract event information leads to a significant
performance penalty, which can have an impact on UI rendering updates if
this is done on the UI thread. There can also be an impact on the battery
life of mobile devices. Given that the media segments will be parsed anyway
by the user agent, parsing in JavaScript is an expensive overhead that
could be avoided.
</p>
<p>
Avoiding parsing in JavaScript is also important for low latency
video streaming applications, where minimizing the time taken to pass
media content through to the media element's playback buffer is
essential.
</p>
<p>
[[HBBTV]] section 9.3.2 describes a mapping between the <code>emsg</code>
fields described <a href="#mpeg-dash">above</a> and the <a>TextTrack</a>
and <a href="https://www.w3.org/TR/2018/WD-html53-20180426/semantics-embedded-content.html#datacue"><code>DataCue</code></a>
APIs. A <a>TextTrack</a> instance is created for each event
stream signalled in the MPD document (as identified by the
<code>schemeIdUri</code> and <code>value</code>), and the
<a href="https://html.spec.whatwg.org/multipage/media.html#dom-texttrack-inbandmetadatatrackdispatchtype"><code>inBandMetadataTrackDispatchType</code></a>
<a>TextTrack</a> attribute contains the <code>scheme_id_uri</code>
and <code>value</code> values. Because HbbTV devices include a native
DASH client, parsing of the MPD document and creation of the
<a>TextTrack</a>s is done by the user agent, rather than by
application JavaScript code.
</p>
</section>
<section>
<h3><a>TextTrackCue</a>s with unbounded duration</h3>
<p>
It is not currently possible to create a <a>TextTrackCue</a> that
extends from a given start time to the end of a live media stream. If
the stream duration is known, the content author can set the cue's
<code>endTime</code> equal to the media duration. However, for live
media streams, where the duration is unbounded, it would be useful to
allow content authors to specify that the <a>TextTrackCue</a> duration
is also unbounded, e.g., by allowing the <code>endTime</code> to be
set to <code>Infinity</code>. This would be consistent with the
<a>media element</a>'s <code>duration</code> property, which can be
<code>Infinity</code> for unbounded streams.
</p>
</section>
<section>
<h3>Synchronized rendering of web resources</h3>
<p>
In browsers, non media web rendering is handled through repaint
operations at a rate that generally matches the display refresh rate
(e.g., 60 times per second), following the user's wall clock. A web
application can schedule actions and render web content at specific
points on the user's wall clock, notably through
<a>Performance.now()</a>, <a>setTimeout()</a>, <a>setInterval()</a>,
and <a>requestAnimationFrame()</a>.
</p>
<p>
In most cases, media rendering follows a different path, be it because
it gets handled by a dedicated background process or by dedicated
hardware circuitry. As a result, progress along the <a>media
timeline</a> may follow a
<a data-cite="HTML/media.html#offsets-into-the-media-resource:media-timeline-8">
clock</a> different from the user's wall clock. [[HTML]] recommends
that the media clock approximate the user's wall clock but does not
require it to match the user's wall clock.
</p>
<p>
To synchronize rendering of web content to a video with frame
accuracy, a web application needs:
</p>
<ul>
<li>
A way to track progress along the <a>media timeline</a> with
<em>sufficient precision</em>. The actual precision required depends
on the use case. Subtitles for video are typically authored against
video at the nominal video frame rate, e.g., 25 frames per second,
which corresponds to 40 milliseconds per frame, even when the actual
video frame rate gets adjusted dynamically ([[EBU-TT-D]], Annex E).
This suggests a 20 milliseconds precision, or half of the duration
of a typical video frame, to render subtitles with frame accuracy.
</li>
<li>
In cases where synchronization needs to occur at frame boundaries, a
way to tie the rendering of non media content, typically done at the
display refresh rate, with the rendering of a video frame. This need
does not replace the former one: a web application that needs to
render web content at media frame boundaries may also need to
perform actions at specific points on the <a>media timeline</a>
regardless of when the next frame gets rendered.
</li>
<li>
A way to prepare the web content to be rendered ahead of time. This
may involve fetching resources, such as images or other related media,
to be rendered.
</li>
</ul>
<p>
The following sub-sections discusses mechanisms currently available to
web applications to track progress on the <a>media timeline</a> and
render content at frame boundaries.
</p>
<section>
<h4>Using cues to track progress on the media timeline</h4>
<p>
Cues (e.g., <a>TextTrackCue</a> and <a>VTTCue</a>) are units of
time-sensitive data on a <a>media timeline</a> [[HTML]]. The <a>time
marches on</a> steps in [[HTML]] control the firing of cue DOM
events during media playback. <a>Time marches on</a> is specified to
run "when the current playback position of a media element changes"
but <em>how often</em> this should happen is unspecified.
In practice it
<a href="https://www.w3.org/2018/12/17-me-minutes.html#item06">has
been found</a> that the timing varies between browser
implementations, in some cases with a delay up to 250 milliseconds
(which corresponds to the lowest rate at which <a>timeupdate</a>
events are expected to be fired).
</p>
<p>
There are two methods a web application can use to handle cues:
</p>
<ul>
<li>
Add an <a>oncuechange</a> handler function to the <a>TextTrack</a>
and inspect the track's <a>activeCues</a> list. Because
<a>activeCues</a> contains the list of cues that are active at the
time that <a>time marches on</a> is run, it is possible for cues
to be missed by a web application using this method, where cues
appear on the <a>media timeline</a> between successive executions
of <a>time marches on</a> during media playback. This may occur
if the cues have short duration, or by a long-running event
handler function.
</li>
<li>
Add <a>onenter</a> and <a>onexit</a> handler functions
to each cue. The <a>time marches on</a> steps guarantee that
<a>enter</a> and <a>exit</a> events will be fired for all cues,
including those that appear on the <a>media timeline</a> between
successive executions of <a>time marches on</a> during media
playback. The timing accuracy of these events varies between
browser implementations, as the firing of the events is controlled
by the rate of execution of <a>time marches on</a>.
</li>
</ul>
<p>
An issue with handling of text track and data cue events in HbbTV
<a href="https://lists.w3.org/Archives/Public/public-inbandtracks/2013Dec/0004.html">was
reported</a> in 2013. HbbTV requires the user agent to implement an
MPEG-DASH client, and so applications must use the first of the
above methods for cue handling, which means that applications can
miss cues as described above. A similar issue has been
<a href="https://github.com/whatwg/html/issues/3137">filed</a>
against the HTML specification.
</p>
</section>
<section>
<h4>Using <code>timeupdate</code> events from the media element</h4>
<p>
Another approach to synchronizing rendering of web content to media
playback is to use the <a>timeupdate</a> DOM event, and for the
web application to manage the <a>media timed event</a> data to be
triggered, rather than use the text track cue APIs in [[HTML]].
This approach has the same synchronization
limitations as described above due to the 250 millisecond update
rate specified in <a>time marches on</a>, and so is
<a data-cite="HTML/media.html#best-practices-for-metadata-text-tracks:event-media-timeupdate">explicitly
discouraged</a> in [[HTML]]. In addition, the timing variability of
<a>timeupdate</a> events between browser engines makes them
unreliable for the purpose of synchronized rendering of web content.
</p>
</section>
<section>
<h4>Polling the current position on the media timeline</h4>
<p>
Synchronization accuracy can be improved by polling the media
element's <a>currentTime</a> property from a <a>setInterval()</a>
callback, or by using <a>requestAnimationFrame()</a> for greater
accuracy. This technique can be useful in where content should be
animated smoothly in synchronicity with the media, for example,
rendering a playhead position marker in an audio waveform
visualization, or displaying web content at specific points on the
<a>media timeline</a>. However, the use of <a>setInterval()</a> or