Describe the bug
Starting with Vite 8.0.14 (which bumps rolldown 1.0.0-rc.18 → 1.0.2 and lands #22342 "pass oxc jsx options to transformSync in dependency scan"), the dep-optimizer corrupts lone-surrogate Unicode escape sequences in CommonJS-imported string literals.
For an input regex string like:
// node_modules/@vscode/markdown-it-katex/.../katex.js (CJS, transitively loads KaTeX)
var tokenRegexString = "([!-\\[\\]-‧-豈-�][̀-ͯ]*|[\uD800-\uDBFF][\uDC00-\uDFFF][̀-ͯ]*|..."
The dep-optimized output at node_modules/.vite/deps/@vscode_markdown-it-katex.js is:
"([!-\\[\\]-‧-豈-�][̀-ͯ]*|[<FFFD>d800-<FFFD>dbff][<FFFD>dc00-<FFFD>dfff][̀-ͯ]*..."
The escapes whose decoded values are valid BMP characters (‧, , , 豈, �) survived. But the escapes whose decoded values are lone surrogates (\uD800, \uDBFF, \uDC00, \uDFFF) were each turned into a U+FFFD REPLACEMENT CHARACTER followed by literal text d800 etc. The resulting [<FFFD>d800-<FFFD>dbff] is parsed by the JS regex engine as a character class containing the literal characters <FFFD>, d, 8, 0, -, b, f — no longer matching surrogate code units in input strings.
For KaTeX's lexer (which uses this regex), the practical impact is that every multi-character control word longer than the backslash escape gets truncated — \sqrt tokenizes as \s (red error) plus the letters qrt as math italics, breaking all rendered math.
This appears to be a UTF-8 round-trip step in the dep-optimizer's minification path that doesn't handle lone surrogates (UTF-8 explicitly cannot represent them; the encoder substitutes U+FFFD).
Reproduction
https://github.com/CluesOverride/td148-vite-lone-surrogate-repro
Steps to reproduce
git clone https://github.com/CluesOverride/td148-vite-lone-surrogate-repro && cd td148-vite-lone-surrogate-repro
npm install
npx vite dev
- Open
node_modules/.vite/deps/@vscode_markdown-it-katex.js and search for dbff — you'll see <U+FFFD>d800-<U+FFFD>dbff instead of \uD800-\uDBFF.
- Open the browser to
http://localhost:5173/. The page's self-check table will show SOURCE PRESERVED? NO, and \sqrt{x} renders as \s (red) + literal math-italic qrtx.
- Re-install with
npm install vite@8.0.13 --save-exact and re-run npx vite dev — the cache file contains preserved \uD800 escapes with zero U+FFFD bytes; KaTeX renders correctly.
Bug is in DEV-mode dep-optimizer specifically — vite build preserves the escapes (the build path uses a different minifier code path).
Expected behavior
The optimized bundle preserves the original string semantics. Lone-surrogate escape sequences in JS string literals should either:
- Remain as
\uXXXX escape sequences in the output (safest), OR
- Be preserved as round-tripping JavaScript string values (each lone surrogate is one UTF-16 code unit;
"\uD800".charCodeAt(0) === 0xD800 is well-defined).
Vite 8.0.11 / 8.0.12 / 8.0.13 (rolldown 1.0.0-rc.18) preserves the strings correctly. The behavior change in 8.0.14 (rolldown 1.0.2 + #22342) appears to assume strings can be re-encoded via UTF-8 round-trip, which silently corrupts lone surrogates.
Actual behavior
Each \uD800-\uDBFF and \uDC00-\uDFFF escape becomes <U+FFFD>d800-<U+FFFD>dbff etc. The JS regex engine parses the corrupted character class differently than the source intended, and downstream consumers (KaTeX's lexer) fail. Six U+FFFD bytes appear in node_modules/.vite/deps/@vscode_markdown-it-katex.js under 8.0.14, zero under 8.0.13.
System Info
System:
OS: macOS 26.5
CPU: (14) arm64 Apple M4 Pro
Memory: 396.00 MB / 48.00 GB
Shell: 5.9 - /bin/zsh
Binaries:
Node: 22.22.2 - /Users/austinfee/.nvm/versions/node/v22.22.2/bin/node
Yarn: 1.22.22
npm: 10.9.7
pnpm: 11.2.2
Browsers:
Chrome: 148.0.7778.179
Firefox: 149.0
Safari: 26.5
npmPackages:
vite: 8.0.14 => 8.0.14
rolldown (transitive): 1.0.2
Used Package Manager
npm
Logs
Not applicable — bug is in the optimized bundle bytes on disk, not in vite --debug console output.
Validations
Describe the bug
Starting with Vite 8.0.14 (which bumps
rolldown1.0.0-rc.18 → 1.0.2 and lands #22342 "pass oxc jsx options to transformSync in dependency scan"), the dep-optimizer corrupts lone-surrogate Unicode escape sequences in CommonJS-imported string literals.For an input regex string like:
The dep-optimized output at
node_modules/.vite/deps/@vscode_markdown-it-katex.jsis:"([!-\\[\\]-‧-豈-�][̀-ͯ]*|[<FFFD>d800-<FFFD>dbff][<FFFD>dc00-<FFFD>dfff][̀-ͯ]*..."The escapes whose decoded values are valid BMP characters (
‧,,,豈,�) survived. But the escapes whose decoded values are lone surrogates (\uD800,\uDBFF,\uDC00,\uDFFF) were each turned into a U+FFFD REPLACEMENT CHARACTER followed by literal textd800etc. The resulting[<FFFD>d800-<FFFD>dbff]is parsed by the JS regex engine as a character class containing the literal characters<FFFD>,d,8,0,-,b,f— no longer matching surrogate code units in input strings.For KaTeX's lexer (which uses this regex), the practical impact is that every multi-character control word longer than the backslash escape gets truncated —
\sqrttokenizes as\s(red error) plus the lettersqrtas math italics, breaking all rendered math.This appears to be a UTF-8 round-trip step in the dep-optimizer's minification path that doesn't handle lone surrogates (UTF-8 explicitly cannot represent them; the encoder substitutes U+FFFD).
Reproduction
https://github.com/CluesOverride/td148-vite-lone-surrogate-repro
Steps to reproduce
git clone https://github.com/CluesOverride/td148-vite-lone-surrogate-repro && cd td148-vite-lone-surrogate-repronpm installnpx vite devnode_modules/.vite/deps/@vscode_markdown-it-katex.jsand search fordbff— you'll see<U+FFFD>d800-<U+FFFD>dbffinstead of\uD800-\uDBFF.http://localhost:5173/. The page's self-check table will showSOURCE PRESERVED? NO, and\sqrt{x}renders as\s(red) + literal math-italicqrtx.npm install vite@8.0.13 --save-exactand re-runnpx vite dev— the cache file contains preserved\uD800escapes with zero U+FFFD bytes; KaTeX renders correctly.Bug is in DEV-mode dep-optimizer specifically —
vite buildpreserves the escapes (the build path uses a different minifier code path).Expected behavior
The optimized bundle preserves the original string semantics. Lone-surrogate escape sequences in JS string literals should either:
\uXXXXescape sequences in the output (safest), OR"\uD800".charCodeAt(0) === 0xD800is well-defined).Vite 8.0.11 / 8.0.12 / 8.0.13 (rolldown 1.0.0-rc.18) preserves the strings correctly. The behavior change in 8.0.14 (rolldown 1.0.2 + #22342) appears to assume strings can be re-encoded via UTF-8 round-trip, which silently corrupts lone surrogates.
Actual behavior
Each
\uD800-\uDBFFand\uDC00-\uDFFFescape becomes<U+FFFD>d800-<U+FFFD>dbffetc. The JS regex engine parses the corrupted character class differently than the source intended, and downstream consumers (KaTeX's lexer) fail. Six U+FFFD bytes appear innode_modules/.vite/deps/@vscode_markdown-it-katex.jsunder 8.0.14, zero under 8.0.13.System Info
System: OS: macOS 26.5 CPU: (14) arm64 Apple M4 Pro Memory: 396.00 MB / 48.00 GB Shell: 5.9 - /bin/zsh Binaries: Node: 22.22.2 - /Users/austinfee/.nvm/versions/node/v22.22.2/bin/node Yarn: 1.22.22 npm: 10.9.7 pnpm: 11.2.2 Browsers: Chrome: 148.0.7778.179 Firefox: 149.0 Safari: 26.5 npmPackages: vite: 8.0.14 => 8.0.14 rolldown (transitive): 1.0.2Used Package Manager
npm
Logs
Not applicable — bug is in the optimized bundle bytes on disk, not in
vite --debugconsole output.Validations
transformSyncbug — cross-filing as appropriate may help)