fix(es/parser): fix panic on JSX surrogate entities#11803
fix(es/parser): fix panic on JSX surrogate entities#11803cuyl wants to merge 2 commits intoswc-project:mainfrom
Conversation
|
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 9c8856281f
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| format!("&{s};"), | ||
| )); | ||
| } | ||
| } else if prev_result != NO_PREV_RESULT { |
There was a problem hiding this comment.
Preserve lone low-surrogate entities instead of dropping them
Low surrogates are only handled when prev_result is already set; when a low surrogate appears first (for example �), this condition is false and the function falls through with an empty value, so callers append nothing and the entity silently disappears from JSX text/attribute values. Even for invalid entities, silently deleting source content is a parsing regression and should be replaced with explicit fallback behavior (e.g., replacement char or raw preservation).
Useful? React with 👍 / 👎.
Binary Sizes
Commit: 118153a |
Merging this PR will degrade performance by 2.06%
|
| Benchmark | BASE |
HEAD |
Efficiency | |
|---|---|---|---|---|
| ❌ | es/large/minify/libraries/terser |
511.2 ms | 521.9 ms | -2.06% |
| ⚡ | es/lints/libs/three |
50.5 ms | 49.3 ms | +2.26% |
Comparing cuyl:jsx-entity (b94cb60) with main (6028240)2
Footnotes
-
31 benchmarks were skipped, so the baseline results were used instead. If they were deleted from the codebase, click here and archive them to remove them from the performance reports. ↩
-
No successful run was found on
main(6f07c6c) during the generation of this report, so 6028240 was used instead as the comparison base. There might be some changes unrelated to this pull request in this report. ↩
This commit fixes a panic (RuntimeError: unreachable) that occurred when the JSX parser encountered numeric entities representing UTF-16 surrogates (e.g., ��). The previous implementation used `char::from_u32(...).expect(...)`, which panics if the value is a surrogate or exceeds the Unicode range. This is now replaced with a safer `parse_from_code` helper that returns an error instead of panicking. Additionally, the lexer now supports merging consecutive JSX entities that form a UTF-16 surrogate pair into a single Unicode character. This matches the behavior of other JSX parsers and ensures that characters outside the Basic Multilingual Plane are correctly decoded when represented as surrogate pairs. Changes: - Replaced panicking `char::from_u32` with safe numeric entity parsing. - Updated `read_jsx_entity` to return `String` to support multi-code-unit characters. - Implemented merging logic for UTF-16 surrogate pairs across consecutive entities. - Added test cases covering surrogate pairs in both JSX attributes and text. Fixes swc-project#11802.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: b94cb6033b
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| } else { | ||
| let value = char::from_u32(result as u32).unwrap().to_string(); | ||
| return Ok((value, format!("&{s};"))); |
There was a problem hiding this comment.
Flush buffered high surrogate before returning next code point
If the lexer buffers a high surrogate from one entity and the following &#...; is a non-surrogate scalar (for example �a), this branch returns only the second character and drops the buffered surrogate entirely. That silently deletes source content in JSX text/attributes for mixed surrogate/non-surrogate sequences instead of preserving or replacing the unmatched surrogate.
Useful? React with 👍 / 👎.
|
I changed some generated code manually (crates/swc_ecma_visit/src/generated.rs), actually I don't know how to generated the code. |
This commit fixes a panic (RuntimeError: unreachable) that occurred when the JSX parser
encountered numeric entities representing UTF-16 surrogates (e.g., ��).
The previous implementation used
char::from_u32(...).expect(...), which panics if thevalue is a surrogate or exceeds the Unicode range. This is now replaced with a safer
parse_from_codehelper that returns an error instead of panicking.Additionally, the lexer now supports merging consecutive JSX entities that form a
UTF-16 surrogate pair into a single Unicode character. This matches the behavior of
other JSX parsers and ensures that characters outside the Basic Multilingual Plane
are correctly decoded when represented as surrogate pairs.
Changes:
char::from_u32with safe numeric entity parsing.read_jsx_entityto returnStringto support multi-code-unit characters.Fixes #11802.