- Introduction
- Overview of proposed web platform changes
- Considered alternatives
- Stakeholder Signals
- Acknowledgements
Native UA rendering of out-of-band subtitles and captions is currently only possible with WebVTT. If a site can't or won't make its subtitles available in-band, and can't or won't publish its subtitles and captions in WebVTT, the site must render its subtitles and captions itself.
If a site implements its own subtitle and caption rendering, it incurs many costs:
- the engineering cost of maintaining bespoke text track display code;
- the inability of bespoke text track display code to respect system conventions or user preferences re: display of captions and subtitles;
- the inability of bespoke code to hook into platform features like picture-in-picture and some forms of media fullscreen;
- relatedly, the need (due to regulatory requirements) to duplicate those system user settings re: how captions and subtitles get displayed;
- the bandwidth cost of delivering all this custom javascript;
- the performance cost of executing all this custom javascript.
Given how costly this is, why do people ever go down this path? Unfortunately, delivering text tracks in WebVTT can sometimes be too costly.
- There are storage and (potentially lossy) conversion costs if you have a large existing corpus of subtitle and caption data in other formats.
- It can also be awkward and error-prone to generate webvtt on the fly for live captioning or similar cases (e.g. the dynamic generation of text tracks using a TTS engine)
- Allow video publishers to continue storing captions in legacy or custom formats.
- Make sure users can control the presence and style of captions with system accessibility settings.
Exposing the values of user preferences to web content is unacceptable for fingerprinting reasons; given this constraint and the above goals, we therefore arrive at a third goal:
- Allow captions to be served in custom formats, but still be rendered by the browser engine.
- It's not a goal to add native processing of out-of-band text track formats other than WebVTT to the web platform.
We propose refactoring the web platform's text track APIs to decouple the delivery format of out-of-band text tracks from the browser's native ability to display subtitles and captions. This should
- reduce the engineering cost of website code development and maintenance;
- reduce the bandwidth cost of delivering all this custom javascript;
- allow websites to (re)gain the ability to respect system conventions or user preferences re: display of captions and subtitles;
- websites no longer need to to duplicate user settings re: how captions and subtitles get displayed;
- let websites (re)gain the ability to hook into platform features like picture-in-picture and media fullscreen;
- see performance benefits from reducing the amount of bespoke javascript involved.
This refactoring may also benefit the internals of browser engines, because in-band text tracks can come in a variety of formats. This refactoring should allow the code that handles each in-band text track format to target a common rendering pipeline which would be the very same rendering pipeline exposed as API to sites for out-of-band text tracks. So this refactoring defines a low-level capability of the platform, it helps to explain what browser engines already do, and it’s pretty straightforward to polyfill. If these points sound familiar, they should! They’re literally three pillars of the Extensible Web Manifesto.
This will require coordinated updates to several specs:
As this proposal matures, we expect it to result in pull requests on these specs.
- Move the pseudo-elements defined in WebVTT CSS extensions to CSS Pseudo-Elements, and define them in a host-language agnostic manner.
- Add a DocumentFragment
cueNode
attribute to TextTrackCue. - Expose a constructor for TextTrackCue which takes
startTime, endTime, and
cueNode
arguments. - Define authoring and processing requirements for cue fragments (which see).
- Add
cue
andcuebackground
global attributes. - Move getCueAsHTML() from VTTCue up to TextTrackCue and
define it in terms of
cueNode
. - Move integration of ::cue and ::cue-region with the platform's text track model from WebVTT to HTML.
- Specify that VTTCue() calls its superclass' new constructor,
generating
cueNode
fromtext
as described in the next item. - Setting the
text
attribute should updatecueNode
by applying the WebVTT cue text DOM construction rules to the result of applying the WebVTT cue text parsing rules to the cue text. - Ensure that the WebVTT cue text DOM construction rules generate a DocumentFragment compatible with the authoring and processing requirements of cue fragments.
A cue fragment is a DocumentFragment containing cue text, represented in a limited subset of HTML and styled with CSS with ::cue and ::cue-region.
The br, div, img, p, rb
, rt, rtc
, ruby, and
span elements are allowed, as are text nodes.
The cue
and cuebackground
attributes must each appear once.
The cue
attribute identifies the element containing the cue text, and
the cuebackground
identifies the cue's background. The cue
attribute
must appear on an element which is a descendent of the element on which
the cuebackground
element appears.
Here's a simple example:
<p cuebackground><span cue>This is a simple cue.</span></p>
We initially considered defining an abstract data model for the formatted text track data. The goal was to define an expressive model that would be easy to generate and manipulate from JavaScript, and that would be straightforward to target from popular formats, minimally WebVTT and IMSC1. WebVTT would then be redefined in terms of this data model.
At TPAC 2019 in Fukuoka, we got a lot of feedback that directly supplying HTML & CSS would be preferable.
Some disliked the esthetics of having a separate tree of nodes that looked an awful lot like, but were not, DOM nodes. Authors of popular JS media libraries made the point that they'll need to keep the code they use to generate HTML anyway, for use on older browsers that don't support these API changes. All agreed that authoring HTML and CSS is something they're comfortable doing in this case.
- Apple / Safari / WebKit: prototyped as an experimental feature named "Generic Text Track Cue API". The prototype is available in Safari Technology Preview.
- Firefox / Gecko: Jean-Yves Avenard and Nils liked the idea. It needs review by someone who works on captioning in Gecko.
- Google / Chrome / Blink: Mixed signals. Chris Cunningham and Joey Parrish have both expressed interest / support, and Mounir Lamouri has expressed opposition.
- Developers of several JS video libraries and websites have expressed support of the general approach, at FOMS and TPAC in 2019.
- Gary Katsevman of Video.js
- Joey Parrish of Shaka Player
- Pierre-Anthony Lemieux of imscJS
- A member of the Netflix frontend team.
Pierre-Anthony Lemieux did extensive work on the earlier form of this proposal, in which he developed an abstract data model—based on IMSC1—that the API would have ingested. While we chose to go in a somewhat (though not very) different direction after feedback at TPAC 2019, we are forever indebted to his excellent groundwork.