-
-
Notifications
You must be signed in to change notification settings - Fork 1.3k
fix(es/ast): Fix unicode unpaired surrogates handling #11144
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
Warning Review the following alerts detected in dependencies. According to your organization's Security Policy, it is recommended to resolve "Warn" alerts. Learn more about Socket for GitHub.
|
f3c6816
to
6d05f37
Compare
CodSpeed Performance ReportMerging #11144 will degrade performances by 2.22%Comparing Summary
Benchmarks breakdown
Footnotes |
ad91796
to
aede87f
Compare
**Description:** This PR fixed an issue related to lone surrogates handling in Rust. This fix's credits all go to Oxc team swc-project#10978 (comment). What I'm doing is porting the fix that was made in Oxc and make it working under SWC. The problem is related to the fundamental difference between how Rust and JavaScript handle Unicode, especially lone surrogates. **JavaScript's Unicode Model** ```javascript // JavaScript allows this - lone surrogates are stored in UTF-16 let str = "\uD800"; // High surrogate alone - technically invalid Unicode let obj = { "\uD800": "value" }; // Works fine in JS ``` JavaScript uses UTF-16 internally and tolerates invalid Unicode sequences: - Strings are UTF-16 code unit sequences, not Unicode scalar sequences - Lone surrogates (U+D800-U+DFFF) are allowed and preserved - No validation that surrogates come in proper high/low pairs - Engine just stores the raw UTF-16 code units **Rust's Unicode Model** ```rust // This CANNOT exist in Rust: let s = "\u{D800}"; // ❌ COMPILE ERROR - not a valid Unicode scalar let c: char = '\u{D800}'; // ❌ COMPILE ERROR - char excludes surrogates ``` Rust enforces strict Unicode validity: - String is UTF-8 and must contain valid Unicode scalar values - char represents Unicode scalar values (U+0000-U+D7FF, U+E000-U+10FFFF) - Surrogate code points (U+D800-U+DFFF) are explicitly excluded - No way to represent lone surrogates in Rust's standard string types 1. AST Structure: Added `lone_surrogates: bool` field to `Str` and `TplElement` structs to track when strings contain lone surrogates 2. Encoding Strategy: Lone surrogates are encoded using \u{FFFD} (replacement character) followed by the original hex digits for internal representation 3. Code Generation: Modified string output to properly escape lone surrogates back to \uXXXX format during codegen 4. Test: Also fixed some cases related to member expression optimizations and string concatenation optimizations 1. Add support for serializing and deserializing literals with lone surrogates in `swc_estree_compat` 2. Reflect AST changes in `binding` crates Breaks the AST by adding `lone_surrogates` field to `Str` and `TplElement` and breaks the `value` and `cooked` respectly in `Str` and `TplElement`. Both of the field is using `\u{FFFD}` (Replacement Character) as an escape if `lone_surrogates` set to `true`. To consume the real value, you need to first check if `lone_surrogates` is `true`, then unescape it by removing the char and construct it with the four trailing hexs(from `\u{FFFD}D800` to `\uD800`). **Related issue:** - Closes swc-project#10978 - Closes swc-project#10353 Fixed a regression of swc-project#7678
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR improves handling of unpaired Unicode surrogates in JavaScript/TypeScript by replacing the implicit lone_surrogates: bool
marker with an explicit Wtf8Atom
type. This breaking change requires users to call to_string_lossy()
or as_str()
to convert WTF-8 atoms to valid UTF-8, making surrogate handling more explicit and preventing silent data loss.
Key Changes
- Replaced
Atom
withWtf8Atom
for string literal values and template element cooked values - Added WTF-8 handling utilities and conversion functions throughout the codebase
- Updated all consumers to handle potential surrogate pair scenarios explicitly
Reviewed Changes
Copilot reviewed 152 out of 205 changed files in this pull request and generated no comments.
Show a summary per file
File | Description |
---|---|
Multiple package.json files |
Fixed missing newline before closing brace |
packages/helpers/package.json |
Added export entry for new TypeScript helper |
crates/swc_typescript/src/fast_dts/ |
Updated to handle Wtf8Atom with fallback conversion |
crates/swc_ecma_utils/src/unicode.rs |
Added Unicode surrogate pair utilities |
crates/swc_ecma_utils/src/lib.rs |
Added WTF-8 string handling functions |
crates/swc_ecma_transforms_*/ |
Updated string operations to use WTF-8 APIs |
crates/swc_ecma_minifier/ |
Updated string optimization to handle surrogates |
crates/swc_ecma_parser/ |
Core lexer changes to produce Wtf8Atom tokens |
Test files | Updated expected outputs and added surrogate tests |
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
Description:
This PR reapplied #10987 and replaced
lone_surrogates: bool
mark withWtf8Atom
.lone_surrogates
introduced in #10987 is more of an implicit mark, whereas the use ofWtf8Atom
makes it more explicit where users need to explicitly callto_string_lossy
to get the lossy UTF-8 result, making a huge difference from the original implementation where users need to convert\uFFFDxxx
into the correct unicode.BREAKING CHANGE:
Both
cooked
inTemplateElement
andvalue
inStringLiteral
are replaced withWtf8Atom
introduced in #11104.Wtf8Atom
does not expose ato_string
method like the oldAtom
does. Internally, it stores the code points of the characters. You can callcode_points()
to get an iterator of the code points or callto_string_lossy()
to get an lossy string in which all unpaired surrogates are replaced withU+FFFD
(Replacement Character).Related issue (if exists):