quickjs-libc: add TextEncoder and TextDecoder by hhugo · Pull Request #1463 · quickjs-ng/quickjs

hhugo · 2026-05-07T15:32:50Z

Disclaimer, the current state was generate with AI.

I found another implementation in nginx/njs@447d66d

The reason for pushing the feature is ocsigen/js_of_ocaml#2229.

Summary

Adds the WHATWG Encoding API's TextEncoder and TextDecoder (UTF-8
only — the only encoding the spec actually requires) to quickjs-libc,
installed on the global from js_std_add_helpers alongside console,
print, and scriptArgs. Reuses the existing utf8_decode/utf8_encode
helpers in cutils.h; no new UTF-8 machinery.

Behavior of note:

TextEncoder.encode / encodeInto treat input as a USVString:
lone surrogates are replaced with U+FFFD before UTF-8 encoding, so
encode("\uD800") returns [EF BF BD], not the CESU-8-like
[ED A0 80] that JS_ToCStringLen would otherwise emit.
encodeInto stringifies the source argument before validating
the destination is a Uint8Array (matches WebIDL ordering), and
stops short rather than writing a partial sequence that doesn't fit.
TextDecoder.decode distinguishes truly partial sequences from
ones that are already known invalid via per-lead continuation-byte
bounds: [E0,41] decodes to U+FFFD U+0041 (not just U+FFFD),
[F4,90] to U+FFFD U+FFFD, [E0,A0] to a single U+FFFD, etc.
Stream mode no longer buffers bytes that already violate the
continuation rules.
Label parser accepts the WHATWG list of UTF-8 aliases
(case-insensitive, ASCII-whitespace trimmed); other encodings throw
RangeError.
BOM, ignoreBOM, fatal, and stream modes follow
the spec: BOM stripped at start of a non-stream run (or first stream
call), kept mid-stream and with ignoreBOM; fatal throws on invalid
bytes including a partial pending on flush; stream defers only when
every available byte is a valid continuation.

js_std_add_text_codecs(JSContext *) is exposed as a public
quickjs-libc entry point and called from run-test262's local-mode
JS_NewCustomContext so the new tests run under make test.

Test plan

make test — 0/60 errors with new tests/test_text_codec.js.
qjs --module tests/test_text_codec.js — passes.
Test coverage: encoder ToString coercion and lone-surrogate
replacement; encodeInto read/written accounting,
partial-write rule, argument-error ordering; decoder label
parsing (aliases + whitespace + case), options dict, input types
(Uint8Array/ArrayBuffer/Int8Array/offset views); BOM
handling (stripped/middle/ignoreBOM/split-across-stream);
partial-vs-invalid trailing sequences (E0+41, E0+80, F0+80,
F4+90, F0+90+7F); fatal mode; stream split at every byte
boundary of a 4-byte sequence; brand checks; no-new rejection.

🤖 Generated with Claude Code

Implements the WHATWG Encoding API's TextEncoder and TextDecoder classes (UTF-8 only, the only encoding the spec actually requires) and installs them on the global object from js_std_add_helpers, alongside `console`, `print`, and `scriptArgs`. TextEncoder: * encode(string?) -> Uint8Array * encodeInto(string, dst) -> { read, written } * encoding -> "utf-8" TextDecoder: * new TextDecoder(label?, { fatal?, ignoreBOM? }) * decode(input?, { stream? }) -> string * encoding / fatal / ignoreBOM accessors decode() handles: * any TypedArray view or ArrayBuffer (BufferSource) as input, * UTF-8 BOM stripping (suppressed by ignoreBOM), * stream mode by saving up to 3 trailing bytes of an incomplete sequence and prepending them on the next call, * fatal mode by throwing TypeError on any encoding error (including a trailing partial sequence in non-stream mode), * non-fatal mode by emitting U+FFFD for each invalid byte. The label parser accepts the WHATWG list of UTF-8 aliases (case-insensitive, ASCII-whitespace trimmed); other encodings throw RangeError, matching the spec. UTF-8 decoding reuses the existing utf8_decode / utf8_decode_len helpers in cutils.h, so no new UTF-8 logic is introduced.

…ng destination Per the WHATWG Encoding spec, encodeInto's first argument is converted to a USVString before the second is checked for being a Uint8Array, so the source's toString side effects must be observable even when the destination is invalid. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…coder The "incomplete trailing sequence" branch only checked the lead byte's declared length against remaining bytes, so a lead followed by an out-of-range continuation (e.g. E0 41, E0 80, F0 80, F4 90) silently dropped the offending byte and emitted a single U+FFFD. Per WHATWG, an out-of-range continuation must produce U+FFFD and be re-read as a fresh lead, yielding e.g. "U+FFFD U+0041" for E0 41 and two U+FFFD for E0 80. Stream mode had the same issue: it would buffer bytes already known to violate the continuation bounds. Add a small helper that returns the first-continuation-byte bounds for each lead (matching utf8_decode's acceptance set) and use it to walk the available bytes; emit eagerly on the first out-of-range byte, and only defer or flush when every available byte is a valid continuation. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

WHATWG Encoding's encode/encodeInto operate on USVStrings: lone surrogates in the input are replaced with U+FFFD before UTF-8 encoding. JS_ToCStringLen, however, keeps lone surrogates and emits them as their 3-byte CESU-8-like encoding (ED A0..BF XX), which is invalid UTF-8 and not what the spec mandates. In encode(), scan the JS_ToCStringLen output for ED A0..BF XX (a triple that valid UTF-8 never produces) and rewrite each occurrence to EF BF BD; the replacement is the same length so the output size is unchanged. The common ASCII/BMP path stays a single allocation+copy. In encodeInto(), the loop already calls utf8_decode per code point; clamp surrogate code points (D800..DFFF) to U+FFFD before re-encoding. The read counter naturally still credits 1 UTF-16 code unit for a lone surrogate and 2 for a matched pair. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Cover the WHATWG Encoding behaviors the existing implementation aims at: encoder ToString coercion, lone-surrogate replacement, encodeInto's read/written semantics and partial-write rule, decoder label parsing, BOM handling (default/ignoreBOM/middle/split), per-lead continuation bounds and the "incomplete vs invalid trailing" distinction, fatal mode, and stream split/flush behavior. The classes are installed by js_std_add_helpers, which run-test262 does not call, so `make test` (run-test262 in local mode) didn't see them. Expose the install as a public js_std_add_text_codecs(ctx) and call it from JS_NewCustomContext's local-mode setup so the new test runs under the same harness as the rest of tests/. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

hhugo and others added 5 commits May 2, 2026 15:57

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

quickjs-libc: add TextEncoder and TextDecoder#1463

quickjs-libc: add TextEncoder and TextDecoder#1463
hhugo wants to merge 5 commits into
quickjs-ng:masterfrom
hhugo:feat/text-encoder-decoder

hhugo commented May 7, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

hhugo commented May 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

hhugo commented May 7, 2026 •

edited

Loading