Skip to content

Ambiguity of the term "word" and its impact on internationalization and testability #547

@murata2makoto

Description

@murata2makoto

The use of the term “word” in accessibility specifications (and Draft Definitions of the Text and Wording subgroup) raises concerns that are even more fundamental than those associated with “character” (see #546). Unlike character-related units, which at least have a formalized representation within Unicode, the concept of a word is explicitly acknowledged to be language-dependent, purpose-dependent, and technology-dependent.

Unicode addresses word boundaries in Unicode Standard Annex #29: Unicode Text Segmentation (https://www.unicode.org/reports/tr29/), which makes clear that word segmentation is not a single, universal operation. Instead, it varies significantly depending on the writing system, the linguistic structure of the language, and the intended use case. UAX #29 explicitly states that word boundaries depend on purpose. For example, word segmentation suitable for cursor movement may differ from segmentation intended for line breaking, text selection, search, indexing, or text-to-speech.

Crucially, this variability is not merely a matter of natural language diversity. It is a consequence of differing technical goals. Word segmentation for assistive technologies may prioritize predictability, learnability, or cognitive load reduction, while other applications may prioritize linguistic precision or algorithmic efficiency. Unicode explicitly acknowledges this tension and does not claim to resolve it.

The limitations of the concept become especially evident in languages that do not use explicit word delimiters, such as Japanese and Chinese. In such languages, what constitutes a “word” depends heavily on morphological analysis, context, and purpose. There is no single segmentation that can be assumed to be correct across all accessibility-related operations. In Japanese, units corresponding to phrases (such as bunsetsu) are often more relevant to reading, comprehension, and speech output than units corresponding to words. This further demonstrates that “word” cannot be assumed to be a universal or accessibility-relevant unit across languages.

Appealing to an intuitive or “user-understood” notion of word does not resolve this defect. Unicode’s own documentation demonstrates that “word” is not a stable perceptual unit even for native speakers, let alone across languages and scripts. The relevant question is therefore not what a word “normally means,” but how the specification expects a word to be identified, processed, and tested in a reproducible manner.

In this context, the unqualified use of the term “word” in accessibility requirements is inherently problematic. Without specifying which segmentation model, which purpose, or which operational definition is intended, the specification leaves implementations with no objective basis for determining compliance. Accordingly, the use of “word” as a normative unit in accessibility specifications represents a substantive conceptual flaw. It undermines testability, interoperability, and the claim of language-neutral applicability.

Metadata

Metadata

Assignees

Labels

Guideline Group: Text And Wordingi18nOf interest to internationalization working groupi18n-trackerGroup bringing to attention of Internationalization, or tracked by i18n but not needing response.

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions