Ambiguity of the term "character" and its impact on internationalization and testability

The term “character” as used in accessibility-related specifications (and [Draft Definitions](https://docs.google.com/document/d/1WhaesDbhuB8SmOeX05-qfo0nSvdXHugpCJUw8rA9nfI/edit?pli=1&tab=t.0#heading=h.hw5a0snww1ed) of the Text and Wording subgroup) raises a fundamental concern. In the context of Unicode, character is a highly nuanced and context-dependent concept, and it cannot be assumed to correspond to a unit that users perceive, read, or manipulate as a single entity. Treating character as a self-evident unit introduces ambiguity at the normative level and undermines the technical soundness of the specification.

Unicode itself explicitly distinguishes between encoded units and user-perceived text. A single unit that users experience as “one character” may consist of multiple code points, as in the case of combining character sequences, surrogate pairs, variation selectors, or emoji sequences. This distinction is formally documented in Unicode Standard Annex #29: Unicode Text Segmentation
https://www.unicode.org/reports/tr29/

which defines both legacy grapheme clusters and extended grapheme clusters precisely because earlier assumptions about characters proved inadequate. The existence of these two models already demonstrates that “character” is not a stable or self-explanatory concept even within Unicode.

However, even extended grapheme clusters are explicitly defined as a segmentation heuristic, not as a definitive model of user perception or comprehension. They provide a lower bound for avoiding incorrect splitting of encoded text, but they do not capture higher-level, language-dependent textual units that are directly relevant to accessibility.

This limitation is particularly evident in documents using CJK scripts and ruby annotations. In such contexts, a base ideograph and its associated ruby annotation may together form a single meaningful unit for reading, navigation, comprehension, or text-to-speech output. Treating these components as independent “characters” can fragment information that users naturally and legitimately process as a whole. Unicode grapheme clusters do not and are not intended to model such relationships.

As a result, accessibility requirements that rely—explicitly or implicitly—on the notion of “character” face a twofold problem. First, they fail to specify whether character refers to a code point, a legacy grapheme cluster, an extended grapheme cluster, or something else entirely. Second, even if grapheme clusters were assumed, they remain insufficient for capturing accessibility-relevant units in many non-Western writing systems.

It may be argued that, by “character,” the specification intends to mean a user-perceived character. This clarification does not resolve the issue. User-perceived character is not a defined term in WCAG, Unicode, or any referenced normative specification. Without an operational definition, the phrase merely replaces one undefined term with another. The critical question therefore remains unanswered: what is a user-perceived character, and how is it determined in a testable and interoperable manner?

Accessibility specifications must be implementable, testable, and interoperable across languages, scripts, platforms, and assistive technologies. Reliance on an undefined notion of “character”—whether qualified as “user-perceived” or not—fails to meet these requirements. This is not a matter of editorial clarity but a substantive defect in the conceptual foundation of the specification.

Unless the term character is rigorously defined or its use reconsidered, accessibility requirements risk being formally correct while practically indeterminate, particularly for users of non-Western scripts. This objection concerns the validity of the specification as a global accessibility standard, not the choice of any particular technical remedy.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ambiguity of the term "character" and its impact on internationalization and testability #546

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Ambiguity of the term "character" and its impact on internationalization and testability #546

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions