Skip to content

quickjs-libc: add TextEncoder and TextDecoder#1463

Open
hhugo wants to merge 5 commits into
quickjs-ng:masterfrom
hhugo:feat/text-encoder-decoder
Open

quickjs-libc: add TextEncoder and TextDecoder#1463
hhugo wants to merge 5 commits into
quickjs-ng:masterfrom
hhugo:feat/text-encoder-decoder

Conversation

@hhugo
Copy link
Copy Markdown

@hhugo hhugo commented May 7, 2026

Disclaimer, the current state was generate with AI.

I found another implementation in nginx/njs@447d66d

The reason for pushing the feature is ocsigen/js_of_ocaml#2229.

related to #1027 (reply in thread)

Summary

Adds the WHATWG Encoding API's TextEncoder and TextDecoder (UTF-8
only — the only encoding the spec actually requires) to quickjs-libc,
installed on the global from js_std_add_helpers alongside console,
print, and scriptArgs. Reuses the existing utf8_decode/utf8_encode
helpers in cutils.h; no new UTF-8 machinery.

Behavior of note:

  • TextEncoder.encode / encodeInto treat input as a USVString:
    lone surrogates are replaced with U+FFFD before UTF-8 encoding, so
    encode("\uD800") returns [EF BF BD], not the CESU-8-like
    [ED A0 80] that JS_ToCStringLen would otherwise emit.
  • encodeInto stringifies the source argument before validating
    the destination is a Uint8Array (matches WebIDL ordering), and
    stops short rather than writing a partial sequence that doesn't fit.
  • TextDecoder.decode distinguishes truly partial sequences from
    ones that are already known invalid via per-lead continuation-byte
    bounds: [E0,41] decodes to U+FFFD U+0041 (not just U+FFFD),
    [F4,90] to U+FFFD U+FFFD, [E0,A0] to a single U+FFFD, etc.
    Stream mode no longer buffers bytes that already violate the
    continuation rules.
  • Label parser accepts the WHATWG list of UTF-8 aliases
    (case-insensitive, ASCII-whitespace trimmed); other encodings throw
    RangeError.
  • BOM, ignoreBOM, fatal, and stream modes follow
    the spec: BOM stripped at start of a non-stream run (or first stream
    call), kept mid-stream and with ignoreBOM; fatal throws on invalid
    bytes including a partial pending on flush; stream defers only when
    every available byte is a valid continuation.

js_std_add_text_codecs(JSContext *) is exposed as a public
quickjs-libc entry point and called from run-test262's local-mode
JS_NewCustomContext so the new tests run under make test.

Test plan

  • make test — 0/60 errors with new tests/test_text_codec.js.
  • qjs --module tests/test_text_codec.js — passes.
  • Test coverage: encoder ToString coercion and lone-surrogate
    replacement; encodeInto read/written accounting,
    partial-write rule, argument-error ordering; decoder label
    parsing (aliases + whitespace + case), options dict, input types
    (Uint8Array/ArrayBuffer/Int8Array/offset views); BOM
    handling (stripped/middle/ignoreBOM/split-across-stream);
    partial-vs-invalid trailing sequences (E0+41, E0+80, F0+80,
    F4+90, F0+90+7F); fatal mode; stream split at every byte
    boundary of a 4-byte sequence; brand checks; no-new rejection.

🤖 Generated with Claude Code

hhugo and others added 5 commits May 2, 2026 15:57
Implements the WHATWG Encoding API's TextEncoder and TextDecoder
classes (UTF-8 only, the only encoding the spec actually requires)
and installs them on the global object from js_std_add_helpers,
alongside `console`, `print`, and `scriptArgs`.

TextEncoder:
  * encode(string?)         -> Uint8Array
  * encodeInto(string, dst) -> { read, written }
  * encoding                -> "utf-8"

TextDecoder:
  * new TextDecoder(label?, { fatal?, ignoreBOM? })
  * decode(input?, { stream? }) -> string
  * encoding / fatal / ignoreBOM accessors

decode() handles:
  * any TypedArray view or ArrayBuffer (BufferSource) as input,
  * UTF-8 BOM stripping (suppressed by ignoreBOM),
  * stream mode by saving up to 3 trailing bytes of an incomplete
    sequence and prepending them on the next call,
  * fatal mode by throwing TypeError on any encoding error
    (including a trailing partial sequence in non-stream mode),
  * non-fatal mode by emitting U+FFFD for each invalid byte.

The label parser accepts the WHATWG list of UTF-8 aliases
(case-insensitive, ASCII-whitespace trimmed); other encodings
throw RangeError, matching the spec.

UTF-8 decoding reuses the existing utf8_decode / utf8_decode_len
helpers in cutils.h, so no new UTF-8 logic is introduced.
…ng destination

Per the WHATWG Encoding spec, encodeInto's first argument is converted
to a USVString before the second is checked for being a Uint8Array, so
the source's toString side effects must be observable even when the
destination is invalid.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…coder

The "incomplete trailing sequence" branch only checked the lead byte's
declared length against remaining bytes, so a lead followed by an
out-of-range continuation (e.g. E0 41, E0 80, F0 80, F4 90) silently
dropped the offending byte and emitted a single U+FFFD. Per WHATWG, an
out-of-range continuation must produce U+FFFD and be re-read as a fresh
lead, yielding e.g. "U+FFFD U+0041" for E0 41 and two U+FFFD for E0 80.
Stream mode had the same issue: it would buffer bytes already known to
violate the continuation bounds.

Add a small helper that returns the first-continuation-byte bounds for
each lead (matching utf8_decode's acceptance set) and use it to walk the
available bytes; emit eagerly on the first out-of-range byte, and only
defer or flush when every available byte is a valid continuation.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
WHATWG Encoding's encode/encodeInto operate on USVStrings: lone
surrogates in the input are replaced with U+FFFD before UTF-8 encoding.
JS_ToCStringLen, however, keeps lone surrogates and emits them as their
3-byte CESU-8-like encoding (ED A0..BF XX), which is invalid UTF-8 and
not what the spec mandates.

In encode(), scan the JS_ToCStringLen output for ED A0..BF XX (a triple
that valid UTF-8 never produces) and rewrite each occurrence to
EF BF BD; the replacement is the same length so the output size is
unchanged. The common ASCII/BMP path stays a single allocation+copy.

In encodeInto(), the loop already calls utf8_decode per code point;
clamp surrogate code points (D800..DFFF) to U+FFFD before re-encoding.
The read counter naturally still credits 1 UTF-16 code unit for a lone
surrogate and 2 for a matched pair.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Cover the WHATWG Encoding behaviors the existing implementation aims at:
encoder ToString coercion, lone-surrogate replacement, encodeInto's
read/written semantics and partial-write rule, decoder label parsing,
BOM handling (default/ignoreBOM/middle/split), per-lead continuation
bounds and the "incomplete vs invalid trailing" distinction, fatal
mode, and stream split/flush behavior.

The classes are installed by js_std_add_helpers, which run-test262 does
not call, so `make test` (run-test262 in local mode) didn't see them.
Expose the install as a public js_std_add_text_codecs(ctx) and call it
from JS_NewCustomContext's local-mode setup so the new test runs under
the same harness as the rest of tests/.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant