fix(es/ast): Fix unicode unpaired surrogates handling #11144

h-a-n-a · 2025-10-11T10:31:46Z

Description:

This PR reapplied #10987 and replaced lone_surrogates: bool mark with Wtf8Atom. lone_surrogates introduced in #10987 is more of an implicit mark, whereas the use of Wtf8Atom makes it more explicit where users need to explicitly call to_string_lossy to get the lossy UTF-8 result, making a huge difference from the original implementation where users need to convert \uFFFDxxx into the correct unicode.

BREAKING CHANGE:

Both cooked in TemplateElement and value in StringLiteral are replaced with Wtf8Atom introduced in #11104.

Wtf8Atom does not expose a to_string method like the old Atom does. Internally, it stores the code points of the characters. You can call code_points() to get an iterator of the code points or call to_string_lossy() to get an lossy string in which all unpaired surrogates are replaced with U+FFFD(Replacement Character).

Related issue (if exists):

changeset-bot · 2025-10-11T10:31:51Z

⚠️ No Changeset found

Latest commit: e9740e5

Merging this PR will not cause a version bump for any packages. If these changes should not result in a new version, you're good to go. If these changes should result in a version bump, you need to add a changeset.

Click here to learn what changesets are, and how to add one.

Click here if you're a maintainer who wants to add a changeset to this PR

socket-security · 2025-10-11T10:32:58Z

Warning

Review the following alerts detected in dependencies.

According to your organization's Security Policy, it is recommended to resolve "Warn" alerts. Learn more about Socket for GitHub.

Action	Severity	Alert (click "▶" to expand/collapse)
Warn		`[email protected]` has a Critical CVE. CVE: GHSA-fjxv-7rqg-78g4 form-data uses unsafe random function in form-data for choosing boundary (CRITICAL) Affected versions: < 2.5.4; >= 3.0.0 < 3.0.4; >= 4.0.0 < 4.0.4 Patched version: 2.5.4 From: `?` → `npm/[email protected]` ℹ Read more on: This package \| This alert \| What is a critical CVE? Next steps: Take a moment to review the security alert above. Review the linked package source code to understand the potential risk. Ensure the package is not malicious before proceeding. If you're unsure how to proceed, reach out to your security team or ask the Socket team for help at `[email protected]`. Suggestion: Remove or replace dependencies that include known critical CVEs. Consumers can use dependency overrides or npm audit fix --force to remove vulnerable dependencies. Mark the package as acceptable risk. To ignore this alert only in this pull request, reply with the comment `@SocketSecurity ignore npm/[email protected]`. You can also ignore all packages with `@SocketSecurity ignore-all`. To ignore an alert for all future pull requests, use Socket's Dashboard to change the triage state of this alert.
Warn		`[email protected]` has Obfuscated code. Confidence: 0.96 Location: Package overview From: `?` → `npm/[email protected]` → `npm/[email protected]` ℹ Read more on: This package \| This alert \| What is obfuscated code? Next steps: Take a moment to review the security alert above. Review the linked package source code to understand the potential risk. Ensure the package is not malicious before proceeding. If you're unsure how to proceed, reach out to your security team or ask the Socket team for help at `[email protected]`. Suggestion: Packages should not obfuscate their code. Consider not using packages with obfuscated code. Mark the package as acceptable risk. To ignore this alert only in this pull request, reply with the comment `@SocketSecurity ignore npm/[email protected]`. You can also ignore all packages with `@SocketSecurity ignore-all`. To ignore an alert for all future pull requests, use Socket's Dashboard to change the triage state of this alert.

View full report

codspeed-hq · 2025-10-15T09:41:17Z

CodSpeed Performance Report

Merging #11144 will degrade performances by 2.22%

_{Comparing h-a-n-a:surrogates (e9740e5) with main (101c3b7)¹}

Summary

⚡ 1 improvement
❌ 1 regression
✅ 138 untouched

⚠️ Please fix the performance issues or acknowledge them on CodSpeed.

Benchmarks breakdown

	Benchmark	`BASE`	`HEAD`	Change
❌	`es/parser/cal-com`	59.2 ms	60.6 ms	-2.22%
⚡	`es/transform/baseline/common_reserved_word`	285 µs	278.9 µs	+2.2%

No successful run was found on main (3ef7ce0) during the generation of this report, so 101c3b7 was used instead as the comparison base. There might be some changes unrelated to this pull request in this report. ↩

**Description:** This PR fixed an issue related to lone surrogates handling in Rust. This fix's credits all go to Oxc team swc-project#10978 (comment). What I'm doing is porting the fix that was made in Oxc and make it working under SWC. The problem is related to the fundamental difference between how Rust and JavaScript handle Unicode, especially lone surrogates. **JavaScript's Unicode Model** ```javascript // JavaScript allows this - lone surrogates are stored in UTF-16 let str = "\uD800"; // High surrogate alone - technically invalid Unicode let obj = { "\uD800": "value" }; // Works fine in JS ``` JavaScript uses UTF-16 internally and tolerates invalid Unicode sequences: - Strings are UTF-16 code unit sequences, not Unicode scalar sequences - Lone surrogates (U+D800-U+DFFF) are allowed and preserved - No validation that surrogates come in proper high/low pairs - Engine just stores the raw UTF-16 code units **Rust's Unicode Model** ```rust // This CANNOT exist in Rust: let s = "\u{D800}"; // ❌ COMPILE ERROR - not a valid Unicode scalar let c: char = '\u{D800}'; // ❌ COMPILE ERROR - char excludes surrogates ``` Rust enforces strict Unicode validity: - String is UTF-8 and must contain valid Unicode scalar values - char represents Unicode scalar values (U+0000-U+D7FF, U+E000-U+10FFFF) - Surrogate code points (U+D800-U+DFFF) are explicitly excluded - No way to represent lone surrogates in Rust's standard string types 1. AST Structure: Added `lone_surrogates: bool` field to `Str` and `TplElement` structs to track when strings contain lone surrogates 2. Encoding Strategy: Lone surrogates are encoded using \u{FFFD} (replacement character) followed by the original hex digits for internal representation 3. Code Generation: Modified string output to properly escape lone surrogates back to \uXXXX format during codegen 4. Test: Also fixed some cases related to member expression optimizations and string concatenation optimizations 1. Add support for serializing and deserializing literals with lone surrogates in `swc_estree_compat` 2. Reflect AST changes in `binding` crates Breaks the AST by adding `lone_surrogates` field to `Str` and `TplElement` and breaks the `value` and `cooked` respectly in `Str` and `TplElement`. Both of the field is using `\u{FFFD}` (Replacement Character) as an escape if `lone_surrogates` set to `true`. To consume the real value, you need to first check if `lone_surrogates` is `true`, then unescape it by removing the char and construct it with the four trailing hexs(from `\u{FFFD}D800` to `\uD800`). **Related issue:** - Closes swc-project#10978 - Closes swc-project#10353 Fixed a regression of swc-project#7678

socket-security · 2025-10-20T08:20:56Z

Review the following changes in direct dependencies. Learn more about Socket for GitHub.

Diff	Package	Supply Chain Security	Quality	Maintenance
	@babel/traverse@7.22.1 ⏵ 7.15.0			^-2
	is-core-module@2.12.1 ⏵ 2.6.0	⁺¹
	function-bind@1.1.2 ⏵ 1.1.1	⁺¹	⁺²
	moment@2.29.1
	@babel/helper-get-function-arity@7.14.5
	toggle-selection@1.0.6
	@babel/helper-hoist-variables@7.18.6 ⏵ 7.14.5
	@babel/helper-split-export-declaration@7.18.6 ⏵ 7.14.5
	@babel/highlight@7.18.6 ⏵ 7.14.5
	@babel/code-frame@7.21.4 ⏵ 7.14.5
	@babel/helper-function-name@7.21.0 ⏵ 7.14.5
	array-tree-filter@2.1.0
	@babel/helper-validator-identifier@7.19.1 ⏵ 7.14.9
	strip-ansi@6.0.0
	lodash.truncate@4.4.2
	@babel/template@7.21.9 ⏵ 7.14.5
	string-convert@0.2.1
	prelude-ls@1.2.1
	astral-regex@2.0.0
	json-stable-stringify-without-jsonify@1.0.1
	json2mq@0.2.0
	text-table@0.2.0
	functional-red-black-tree@1.0.1
	loose-envify@1.4.0
	resolve-from@5.0.0 ⏵ 4.0.0	⁺¹	^-1
	deep-is@0.1.3
	shallowequal@1.1.0
	lodash.clonedeep@4.5.0
	require-from-string@2.0.2
	levn@0.4.1
	sprintf-js@1.0.3
	lodash.merge@4.6.2
	@humanwhocodes/object-schema@1.2.0
See 84 more rows in the dashboard

View full report

Copilot

Pull Request Overview

This PR improves handling of unpaired Unicode surrogates in JavaScript/TypeScript by replacing the implicit lone_surrogates: bool marker with an explicit Wtf8Atom type. This breaking change requires users to call to_string_lossy() or as_str() to convert WTF-8 atoms to valid UTF-8, making surrogate handling more explicit and preventing silent data loss.

Key Changes

Replaced Atom with Wtf8Atom for string literal values and template element cooked values
Added WTF-8 handling utilities and conversion functions throughout the codebase
Updated all consumers to handle potential surrogate pair scenarios explicitly

Reviewed Changes

Copilot reviewed 152 out of 205 changed files in this pull request and generated no comments.

Show a summary per file

File	Description
Multiple `package.json` files	Fixed missing newline before closing brace
`packages/helpers/package.json`	Added export entry for new TypeScript helper
`crates/swc_typescript/src/fast_dts/`	Updated to handle `Wtf8Atom` with fallback conversion
`crates/swc_ecma_utils/src/unicode.rs`	Added Unicode surrogate pair utilities
`crates/swc_ecma_utils/src/lib.rs`	Added WTF-8 string handling functions
`crates/swc_ecma_transforms_*/`	Updated string operations to use WTF-8 APIs
`crates/swc_ecma_minifier/`	Updated string optimization to handle surrogates
`crates/swc_ecma_parser/`	Core lexer changes to produce `Wtf8Atom` tokens
Test files	Updated expected outputs and added surrogate tests

_{Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.}

h-a-n-a force-pushed the surrogates branch from 8e7c72b to 6973720 Compare October 11, 2025 10:32

h-a-n-a force-pushed the surrogates branch 6 times, most recently from f3c6816 to 6d05f37 Compare October 15, 2025 09:36

h-a-n-a force-pushed the surrogates branch 6 times, most recently from ad91796 to aede87f Compare October 17, 2025 10:27

h-a-n-a force-pushed the surrogates branch from aede87f to 3362795 Compare October 20, 2025 08:16

h-a-n-a force-pushed the surrogates branch from 3362795 to 9df7324 Compare October 20, 2025 08:43

feat: init

e9740e5

h-a-n-a force-pushed the surrogates branch from 9df7324 to e9740e5 Compare October 20, 2025 09:05

h-a-n-a marked this pull request as ready for review October 20, 2025 09:38

h-a-n-a requested review from a team as code owners October 20, 2025 09:38

Copilot AI review requested due to automatic review settings October 20, 2025 09:38

Copilot AI reviewed Oct 20, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

fix(es/ast): Fix unicode unpaired surrogates handling #11144

fix(es/ast): Fix unicode unpaired surrogates handling #11144

h-a-n-a commented Oct 11, 2025 •

edited

Loading

Uh oh!

changeset-bot bot commented Oct 11, 2025 •

edited

Loading

Uh oh!

socket-security bot commented Oct 11, 2025 •

edited

Loading

Uh oh!

codspeed-hq bot commented Oct 15, 2025 •

edited

Loading

Uh oh!

socket-security bot commented Oct 20, 2025 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

1 participant

Uh oh!

fix(es/ast): Fix unicode unpaired surrogates handling #11144

Are you sure you want to change the base?

fix(es/ast): Fix unicode unpaired surrogates handling #11144

Conversation

h-a-n-a commented Oct 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

changeset-bot bot commented Oct 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

⚠️ No Changeset found

Uh oh!

socket-security bot commented Oct 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codspeed-hq bot commented Oct 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

CodSpeed Performance Report

Merging #11144 will degrade performances by 2.22%

Summary

Benchmarks breakdown

Footnotes

Uh oh!

socket-security bot commented Oct 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Key Changes

Reviewed Changes

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

1 participant

h-a-n-a commented Oct 11, 2025 •

edited

Loading

changeset-bot bot commented Oct 11, 2025 •

edited

Loading

socket-security bot commented Oct 11, 2025 •

edited

Loading

codspeed-hq bot commented Oct 15, 2025 •

edited

Loading

socket-security bot commented Oct 20, 2025 •

edited

Loading