This proposal is an early design sketch by Yash Raj Bharti to describe the problem below and solicit feedback on the proposed solution. It has not been approved to ship in Chrome.
- Yash Raj Bharti, Open Source Contributor, Liquid Galaxy (Google Open Source) Mentor
- Introduction
- Goals
- Non-goals
- User research
- Use cases
- Potential Solution
- Detailed design discussion
- Considered alternatives
- Stakeholder Feedback / Opposition
- Feedback Q&A
- References & acknowledgements
Only 0.5% of web videos include a <track>
tag (source: The Web Almanac by HTTP Archive 2024 Report), leaving a majority of online video content inaccessible for individuals who rely on captions. This project proposes a built-in UI option in video elements that enables browsers to auto-generate captions dynamically when a <track>
element is missing. This solution aims to improve web accessibility without requiring explicit authoring changes.
- Address the accessibility gap by promoting captions as a default feature on the web.
- Provide an implicit browser feature to generate captions dynamically when no
<track>
element exists. - Simplify the process of captioning for users, reducing manual effort for developers.
- Allow user agents (UAs) to innovate and compete in improving the user experience.
- Replace manually created captions or subtitles.
- Standardize the specific implementation of auto-generated captions across browsers.
- Focus on live video streams (future scope).
The proposal is informed by research indicating that captions improve content comprehension, user engagement, and accessibility for people with hearing impairments and non-native speakers. The 0.5% adoption rate of <track>
tags highlights the need for a low-effort solution that works automatically.
Using web technologies (HTML, CSS, and JS) combined with Whisper API models by HuggingFace's transformer.js, here's a working example of auto-generate captions from this GitHub repo for Captions on the Fly.
A small business uploads product videos to their website but lacks the resources to create captions manually. By enabling the built-in captions option in the browser UI, users can generate captions dynamically, improving accessibility and SEO.
An educational platform hosts a library of lecture videos. By leveraging the browser’s built-in captioning option, viewers can enable captions in their preferred language without requiring the platform to provide them manually.
In environments like the metro, train, or library, where audio cannot be played out loud, auto-generated captions enable users to follow video content silently. This feature improves usability and accessibility, especially for people who rely on visual text cues in noise-sensitive or quiet environments.
Instead of introducing an explicit autogenerate
attribute in the <track>
element, this proposal suggests that browsers implement an implicit UI feature in the video player. This feature allows users to opt-in for auto-generated captions when no <track>
element is present.
For example:
<video controls>
<source src="example.mp4" type="video/mp4">
</video>
When the browser detects the absence of a <track>
element, it displays a captions button in the video player controls. Clicking the button triggers the browser’s built-in speech recognition and language models to generate captions dynamically.
Small businesses can rely on the browser’s built-in captions feature to make their video content accessible without hiring transcription services.
Educational platforms can provide multi-language captions through the browser’s built-in UI, reducing the need for manual transcription efforts.
Users in such environments (the metro, train, or library), can rely on auto-generated captions without need for additional hardware or software.
How should browsers handle unsupported languages in the auto-generated captions feature?
Solution: Fallback to the default language or display a warning to the user that the selected language is unavailable.
How can we prevent abuse or inaccurate captions from degrading the user experience?
Solution: Provide visual indicators (e.g., a "Generated Captions" badge) and allow users to toggle the feature on or off easily.
Add the autogenerate
attribute directly to the <video>
element.
Rejected: Lacks flexibility for multi-language support and shifts responsibility to developers rather than empowering users.
Leave caption generation entirely to external tools.
Rejected: Adds complexity for developers and excludes dynamic web content.
- Positive Feedback: Accessibility advocates emphasize the importance of lowering barriers to captioning.
- Negative Feedback: Some concerns about performance impacts for large video libraries.
- Answer: While manual captions are often more accurate and tailored, the built-in captions feature is designed as a complementary tool, not a replacement. It ensures accessibility where no captions exist, providing an immediate benefit. To encourage manual captioning, browsers could display a notification recommending content creators add their own captions.
2. Does it make sense for every user who watches a video to run the same subtitle creation steps repeatedly, or would it be better to run this once on the server and deliver the same subtitles to everyone?
- Answer: Server-side generation ensures consistency and reduces computational load for users, but client-side generation offers flexibility for dynamic or personalized content. A hybrid approach—generating and caching captions locally—could optimize performance.
3. Could a centralized storage solution help reduce energy waste by avoiding repeated caption generation?
- Answer: Yes, a centralized storage system could significantly reduce redundancy and energy usage. For example, captions could be generated once and stored on a server or within a content delivery network (CDN). Subsequent users accessing the same video would retrieve pre-generated captions, saving computational resources. This approach would require careful consideration of storage costs, privacy, and synchronization across devices.
- Answer: For live-streamed videos, real-time caption generation is essential. The proposed built-in feature is designed to handle such cases, ensuring accessibility for live events.
- Answer: Browsers can handle text overflow by applying word wrapping, truncation, or adjustable font sizes. Developers could also provide styling options for captions to manage layout constraints.
- Answer: The feature can include synchronization buffers and confidence thresholds to align captions with spoken content and suppress low-quality captions.
- Answer: The
<track>
element by definition says, “Captions exist for this video, and this element provides them.” Introducing a case where a<track>
element indicates, “Captions don’t exist, but the UA should generate them,” contradicts this purpose. Instead, leveraging the absence of a<track>
element as a signal for auto-generation better aligns with the intent of the standard.
- Answer: Real-time video captioning for live streams is challenging due to performance constraints. While initial implementations may not be perfect, rigorous testing and advancements in balancing model size and efficiency can help optimize this feature over time.
- Answer: Yes, incorporating expressive captions, similar to features seen on Google Pixel and Android devices, could enhance user experience. By including additional contextual or emotional cues, such captions would lay the foundation for more immersive accessibility solutions.
Many thanks for valuable feedback / advice / support from:
- Thomas Steiner, Google
- Adam Argyle, Google
- Dirk Ginader, Google
- Kenji Baheux, Google
- Mike Smith, W3C
- The Web Almanac by HTTP Archive 2024