quickjs-libc: add TextEncoder and TextDecoder#1463
Open
hhugo wants to merge 5 commits into
Open
Conversation
Implements the WHATWG Encoding API's TextEncoder and TextDecoder
classes (UTF-8 only, the only encoding the spec actually requires)
and installs them on the global object from js_std_add_helpers,
alongside `console`, `print`, and `scriptArgs`.
TextEncoder:
* encode(string?) -> Uint8Array
* encodeInto(string, dst) -> { read, written }
* encoding -> "utf-8"
TextDecoder:
* new TextDecoder(label?, { fatal?, ignoreBOM? })
* decode(input?, { stream? }) -> string
* encoding / fatal / ignoreBOM accessors
decode() handles:
* any TypedArray view or ArrayBuffer (BufferSource) as input,
* UTF-8 BOM stripping (suppressed by ignoreBOM),
* stream mode by saving up to 3 trailing bytes of an incomplete
sequence and prepending them on the next call,
* fatal mode by throwing TypeError on any encoding error
(including a trailing partial sequence in non-stream mode),
* non-fatal mode by emitting U+FFFD for each invalid byte.
The label parser accepts the WHATWG list of UTF-8 aliases
(case-insensitive, ASCII-whitespace trimmed); other encodings
throw RangeError, matching the spec.
UTF-8 decoding reuses the existing utf8_decode / utf8_decode_len
helpers in cutils.h, so no new UTF-8 logic is introduced.
…ng destination Per the WHATWG Encoding spec, encodeInto's first argument is converted to a USVString before the second is checked for being a Uint8Array, so the source's toString side effects must be observable even when the destination is invalid. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…coder The "incomplete trailing sequence" branch only checked the lead byte's declared length against remaining bytes, so a lead followed by an out-of-range continuation (e.g. E0 41, E0 80, F0 80, F4 90) silently dropped the offending byte and emitted a single U+FFFD. Per WHATWG, an out-of-range continuation must produce U+FFFD and be re-read as a fresh lead, yielding e.g. "U+FFFD U+0041" for E0 41 and two U+FFFD for E0 80. Stream mode had the same issue: it would buffer bytes already known to violate the continuation bounds. Add a small helper that returns the first-continuation-byte bounds for each lead (matching utf8_decode's acceptance set) and use it to walk the available bytes; emit eagerly on the first out-of-range byte, and only defer or flush when every available byte is a valid continuation. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
WHATWG Encoding's encode/encodeInto operate on USVStrings: lone surrogates in the input are replaced with U+FFFD before UTF-8 encoding. JS_ToCStringLen, however, keeps lone surrogates and emits them as their 3-byte CESU-8-like encoding (ED A0..BF XX), which is invalid UTF-8 and not what the spec mandates. In encode(), scan the JS_ToCStringLen output for ED A0..BF XX (a triple that valid UTF-8 never produces) and rewrite each occurrence to EF BF BD; the replacement is the same length so the output size is unchanged. The common ASCII/BMP path stays a single allocation+copy. In encodeInto(), the loop already calls utf8_decode per code point; clamp surrogate code points (D800..DFFF) to U+FFFD before re-encoding. The read counter naturally still credits 1 UTF-16 code unit for a lone surrogate and 2 for a matched pair. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Cover the WHATWG Encoding behaviors the existing implementation aims at: encoder ToString coercion, lone-surrogate replacement, encodeInto's read/written semantics and partial-write rule, decoder label parsing, BOM handling (default/ignoreBOM/middle/split), per-lead continuation bounds and the "incomplete vs invalid trailing" distinction, fatal mode, and stream split/flush behavior. The classes are installed by js_std_add_helpers, which run-test262 does not call, so `make test` (run-test262 in local mode) didn't see them. Expose the install as a public js_std_add_text_codecs(ctx) and call it from JS_NewCustomContext's local-mode setup so the new test runs under the same harness as the rest of tests/. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Disclaimer, the current state was generate with AI.
I found another implementation in nginx/njs@447d66d
The reason for pushing the feature is ocsigen/js_of_ocaml#2229.
related to #1027 (reply in thread)
Summary
Adds the WHATWG Encoding API's
TextEncoderandTextDecoder(UTF-8only — the only encoding the spec actually requires) to
quickjs-libc,installed on the global from
js_std_add_helpersalongsideconsole,print, andscriptArgs. Reuses the existingutf8_decode/utf8_encodehelpers in
cutils.h; no new UTF-8 machinery.Behavior of note:
TextEncoder.encode/encodeIntotreat input as a USVString:lone surrogates are replaced with U+FFFD before UTF-8 encoding, so
encode("\uD800")returns[EF BF BD], not the CESU-8-like[ED A0 80]thatJS_ToCStringLenwould otherwise emit.encodeIntostringifies the source argument before validatingthe destination is a
Uint8Array(matches WebIDL ordering), andstops short rather than writing a partial sequence that doesn't fit.
TextDecoder.decodedistinguishes truly partial sequences fromones that are already known invalid via per-lead continuation-byte
bounds:
[E0,41]decodes toU+FFFD U+0041(not justU+FFFD),[F4,90]toU+FFFD U+FFFD,[E0,A0]to a singleU+FFFD, etc.Stream mode no longer buffers bytes that already violate the
continuation rules.
(case-insensitive, ASCII-whitespace trimmed); other encodings throw
RangeError.ignoreBOM,fatal, andstreammodes followthe spec: BOM stripped at start of a non-stream run (or first stream
call), kept mid-stream and with
ignoreBOM; fatal throws on invalidbytes including a partial pending on flush; stream defers only when
every available byte is a valid continuation.
js_std_add_text_codecs(JSContext *)is exposed as a publicquickjs-libcentry point and called fromrun-test262's local-modeJS_NewCustomContextso the new tests run undermake test.Test plan
make test— 0/60 errors with newtests/test_text_codec.js.qjs --module tests/test_text_codec.js— passes.replacement;
encodeIntoread/writtenaccounting,partial-write rule, argument-error ordering; decoder label
parsing (aliases + whitespace + case), options dict, input types
(
Uint8Array/ArrayBuffer/Int8Array/offset views); BOMhandling (stripped/middle/
ignoreBOM/split-across-stream);partial-vs-invalid trailing sequences (E0+41, E0+80, F0+80,
F4+90, F0+90+7F); fatal mode; stream split at every byte
boundary of a 4-byte sequence; brand checks; no-
newrejection.🤖 Generated with Claude Code