Skip to content

Conversation

h-a-n-a
Copy link
Contributor

@h-a-n-a h-a-n-a commented Oct 11, 2025

Description:

This PR reapplied #10987 and replaced lone_surrogates: bool mark with Wtf8Atom. lone_surrogates introduced in #10987 is more of an implicit mark, whereas the use of Wtf8Atom makes it more explicit where users need to explicitly call to_string_lossy to get the lossy UTF-8 result, making a huge difference from the original implementation where users need to convert \uFFFDxxx into the correct unicode.

BREAKING CHANGE:

Both cooked in TemplateElement and value in StringLiteral are replaced with Wtf8Atom introduced in #11104.

Wtf8Atom does not expose a to_string method like the old Atom does. Internally, it stores the code points of the characters. You can call code_points() to get an iterator of the code points or call to_string_lossy() to get an lossy string in which all unpaired surrogates are replaced with U+FFFD(Replacement Character).

Related issue (if exists):

Copy link

changeset-bot bot commented Oct 11, 2025

⚠️ No Changeset found

Latest commit: e9740e5

Merging this PR will not cause a version bump for any packages. If these changes should not result in a new version, you're good to go. If these changes should result in a version bump, you need to add a changeset.

Click here to learn what changesets are, and how to add one.

Click here if you're a maintainer who wants to add a changeset to this PR

Copy link

socket-security bot commented Oct 11, 2025

Warning

Review the following alerts detected in dependencies.

According to your organization's Security Policy, it is recommended to resolve "Warn" alerts. Learn more about Socket for GitHub.

Action Severity Alert  (click "▶" to expand/collapse)
Warn Critical
[email protected] has a Critical CVE.

CVE: GHSA-fjxv-7rqg-78g4 form-data uses unsafe random function in form-data for choosing boundary (CRITICAL)

Affected versions: < 2.5.4; >= 3.0.0 < 3.0.4; >= 4.0.0 < 4.0.4

Patched version: 2.5.4

From: ?npm/[email protected]

ℹ Read more on: This package | This alert | What is a critical CVE?

Next steps: Take a moment to review the security alert above. Review the linked package source code to understand the potential risk. Ensure the package is not malicious before proceeding. If you're unsure how to proceed, reach out to your security team or ask the Socket team for help at [email protected].

Suggestion: Remove or replace dependencies that include known critical CVEs. Consumers can use dependency overrides or npm audit fix --force to remove vulnerable dependencies.

Mark the package as acceptable risk. To ignore this alert only in this pull request, reply with the comment @SocketSecurity ignore npm/[email protected]. You can also ignore all packages with @SocketSecurity ignore-all. To ignore an alert for all future pull requests, use Socket's Dashboard to change the triage state of this alert.

Warn High
[email protected] has Obfuscated code.

Confidence: 0.96

Location: Package overview

From: ?npm/[email protected]npm/[email protected]

ℹ Read more on: This package | This alert | What is obfuscated code?

Next steps: Take a moment to review the security alert above. Review the linked package source code to understand the potential risk. Ensure the package is not malicious before proceeding. If you're unsure how to proceed, reach out to your security team or ask the Socket team for help at [email protected].

Suggestion: Packages should not obfuscate their code. Consider not using packages with obfuscated code.

Mark the package as acceptable risk. To ignore this alert only in this pull request, reply with the comment @SocketSecurity ignore npm/[email protected]. You can also ignore all packages with @SocketSecurity ignore-all. To ignore an alert for all future pull requests, use Socket's Dashboard to change the triage state of this alert.

View full report

@h-a-n-a h-a-n-a force-pushed the surrogates branch 6 times, most recently from f3c6816 to 6d05f37 Compare October 15, 2025 09:36
Copy link

codspeed-hq bot commented Oct 15, 2025

CodSpeed Performance Report

Merging #11144 will degrade performances by 2.22%

Comparing h-a-n-a:surrogates (e9740e5) with main (101c3b7)1

Summary

⚡ 1 improvement
❌ 1 regression
✅ 138 untouched

⚠️ Please fix the performance issues or acknowledge them on CodSpeed.

Benchmarks breakdown

Benchmark BASE HEAD Change
es/parser/cal-com 59.2 ms 60.6 ms -2.22%
es/transform/baseline/common_reserved_word 285 µs 278.9 µs +2.2%

Footnotes

  1. No successful run was found on main (3ef7ce0) during the generation of this report, so 101c3b7 was used instead as the comparison base. There might be some changes unrelated to this pull request in this report.

@h-a-n-a h-a-n-a force-pushed the surrogates branch 6 times, most recently from ad91796 to aede87f Compare October 17, 2025 10:27
**Description:**

This PR fixed an issue related to lone surrogates handling in Rust.

This fix's credits all go to Oxc team
swc-project#10978 (comment).
What I'm doing is porting the fix that was made in Oxc and make it
working under SWC.

The problem is related to the fundamental difference between how Rust
and JavaScript handle Unicode, especially lone surrogates.

**JavaScript's Unicode Model**

```javascript
// JavaScript allows this - lone surrogates are stored in UTF-16
let str = "\uD800";  // High surrogate alone - technically invalid Unicode
let obj = { "\uD800": "value" };  // Works fine in JS
```
JavaScript uses UTF-16 internally and tolerates invalid Unicode
sequences:
- Strings are UTF-16 code unit sequences, not Unicode scalar sequences
- Lone surrogates (U+D800-U+DFFF) are allowed and preserved
- No validation that surrogates come in proper high/low pairs
- Engine just stores the raw UTF-16 code units

**Rust's Unicode Model**

```rust
// This CANNOT exist in Rust:
let s = "\u{D800}";  // ❌ COMPILE ERROR - not a valid Unicode scalar
let c: char = '\u{D800}';  // ❌ COMPILE ERROR - char excludes surrogates
```
Rust enforces strict Unicode validity:
- String is UTF-8 and must contain valid Unicode scalar values
- char represents Unicode scalar values (U+0000-U+D7FF, U+E000-U+10FFFF)
- Surrogate code points (U+D800-U+DFFF) are explicitly excluded
- No way to represent lone surrogates in Rust's standard string types

1. AST Structure: Added `lone_surrogates: bool` field to `Str` and
`TplElement` structs to track when strings contain lone surrogates
2. Encoding Strategy: Lone surrogates are encoded using \u{FFFD}
(replacement character) followed by the original hex digits for internal
representation
3. Code Generation: Modified string output to properly escape lone
surrogates back to \uXXXX format during codegen
4. Test: Also fixed some cases related to member expression
optimizations and string concatenation optimizations

1. Add support for serializing and deserializing literals with lone
surrogates in `swc_estree_compat`
2. Reflect AST changes in `binding` crates

Breaks the AST by adding `lone_surrogates` field to `Str` and
`TplElement` and breaks the `value` and `cooked` respectly in `Str` and
`TplElement`. Both of the field is using `\u{FFFD}` (Replacement
Character) as an escape if `lone_surrogates` set to `true`.

To consume the real value, you need to first check if `lone_surrogates`
is `true`, then unescape it by removing the char and construct it with
the four trailing hexs(from `\u{FFFD}D800` to `\uD800`).

**Related issue:**

 - Closes swc-project#10978
 - Closes swc-project#10353

Fixed a regression of swc-project#7678
Copy link

socket-security bot commented Oct 20, 2025

@h-a-n-a h-a-n-a marked this pull request as ready for review October 20, 2025 09:38
@h-a-n-a h-a-n-a requested review from a team as code owners October 20, 2025 09:38
@Copilot Copilot AI review requested due to automatic review settings October 20, 2025 09:38
Copy link

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR improves handling of unpaired Unicode surrogates in JavaScript/TypeScript by replacing the implicit lone_surrogates: bool marker with an explicit Wtf8Atom type. This breaking change requires users to call to_string_lossy() or as_str() to convert WTF-8 atoms to valid UTF-8, making surrogate handling more explicit and preventing silent data loss.

Key Changes

  • Replaced Atom with Wtf8Atom for string literal values and template element cooked values
  • Added WTF-8 handling utilities and conversion functions throughout the codebase
  • Updated all consumers to handle potential surrogate pair scenarios explicitly

Reviewed Changes

Copilot reviewed 152 out of 205 changed files in this pull request and generated no comments.

Show a summary per file
File Description
Multiple package.json files Fixed missing newline before closing brace
packages/helpers/package.json Added export entry for new TypeScript helper
crates/swc_typescript/src/fast_dts/ Updated to handle Wtf8Atom with fallback conversion
crates/swc_ecma_utils/src/unicode.rs Added Unicode surrogate pair utilities
crates/swc_ecma_utils/src/lib.rs Added WTF-8 string handling functions
crates/swc_ecma_transforms_*/ Updated string operations to use WTF-8 APIs
crates/swc_ecma_minifier/ Updated string optimization to handle surrogates
crates/swc_ecma_parser/ Core lexer changes to produce Wtf8Atom tokens
Test files Updated expected outputs and added surrogate tests

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

1 participant