Skip to content

Fix UTF8 encoding of invalid character #3191

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 17 commits into
base: main
Choose a base branch
from

Conversation

alberk8
Copy link
Contributor

@alberk8 alberk8 commented Jun 30, 2025

Description

  1. Fix ConvertFromUTF8 to properly assign the U+FFFD character for invalid UTF8 character, for UTF8 1, 2, 3 or 4 bytes. This match up with .Net 8 UTF8 result using the unit test.
  2. Support 4 byte UTF8 assignment.
  3. Fix CountNumberOfCharacters
  4. Fix CLR_GFX_Font::StringOut and CLR_GFX_Font::CountCharactersInWidth. maxChars starts out as -1 and decrementing it will result in infinite loop due to the corrected ConvertFromUTF8 UTF-8 character.
  5. Unit Test updated in System.Text

Motivation and Context

How Has This Been Tested?

Screenshots

Types of changes

  • Improvement (non-breaking change that improves a feature, code or algorithm)
  • Bug fix (non-breaking change which fixes an issue with code or algorithm)
  • New feature (non-breaking change which adds functionality to code)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Config and build (change in the configuration and build system, has no impact on code or features)
  • Dev Containers (changes related with Dev Containers, has no impact on code or features)
  • Dependencies/declarations (update dependencies or assembly declarations and changes associated, has no impact on code or features)
  • Documentation (changes or updates in the documentation, has no impact on code or features)

Checklist

  • My code follows the code style of this project (only if there are changes in source code).
  • My changes require an update to the documentation (there are changes that require the docs website to be updated).
  • I have updated the documentation accordingly (the changes require an update on the docs in this repo).
  • I have read the CONTRIBUTING document.
  • I have tested everything locally and all new and existing tests passed (only if there are changes in source code).

Summary by CodeRabbit

Summary by CodeRabbit

  • Bug Fixes

    • Improved handling and validation of UTF-8 and UTF-16 string encoding and decoding, resulting in more robust and accurate processing of Unicode text.
    • Enhanced error handling for invalid or malformed Unicode sequences, reducing potential issues with string operations.
  • Refactor

    • Updated internal string handling logic for greater correctness and consistency in character counting and conversion.
    • Adjusted font rendering methods with minor code cleanup for clarity.
    • Refactored string search functionality to improve modularity and maintain consistent behavior.

Copy link

coderabbitai bot commented Jun 30, 2025

"""

Walkthrough

The changes overhaul UTF-8 and UTF-16 string handling in core utility code, focusing on explicit validation, robust error handling, and correct character counting and conversion. The IndexOf method for strings is refactored to use a new helper function for substring matching. Font rendering code now uses accurate character counts from the Unicode helper.

Changes

File(s) Change Summary
src/CLR/CorLib/corlib_native_System_String.cpp Refactored IndexOf method to use new static inline helper MatchString for substring matching with UTF-8 decoding.
src/CLR/Core/CLR_RT_UnicodeHelper.cpp Rewrote UTF-8/UTF-16 counting, conversion, and backward movement logic for explicit validation and robust error handling; updated UnicodeString assignment and release methods.
src/nanoFramework.Graphics/.../Fonts/Font.cpp Added blank line after UTF-8 input initialization in CountCharactersInWidth method (whitespace only).

Sequence Diagram(s)

sequenceDiagram
    participant Font as CLR_GFX_Font
    participant Helper as CLR_RT_UnicodeHelper

    Font->>Helper: CountNumberOfCharacters()
    Helper-->>Font: Returns accurate character count
    Font->>Font: Iterate over string using correct character limit
Loading
sequenceDiagram
    participant Helper as CLR_RT_UnicodeHelper

    Note over Helper: UTF-8 to UTF-16 Conversion
    Helper->>Helper: Validate input bytes
    alt Valid sequence
        Helper->>Helper: Convert to UTF-16 code unit(s)
    else Invalid sequence
        Helper->>Helper: Output U+FFFD replacement character
    end
    Helper-->>Caller: Return converted string or error
Loading
sequenceDiagram
    participant String as corlib_native_System_String
    participant Helper as CLR_RT_UnicodeHelper

    String->>Helper: MatchString(inputIter, searchStr, searchCharLen)
    Helper->>Helper: Compare input and search string char-by-char
    alt All match
        Helper-->>String: Return true
    else Mismatch or end of input
        Helper-->>String: Return false
    end
Loading

Assessment against linked issues

Objective Addressed Explanation
Replace exception on invalid UTF-8 sequences in Encoding.UTF8.GetString with replacement char U+FFFD (Issue #1648)
Ensure UTF-8 decoding does not throw but returns replacement characters on invalid sequences (Issue #1648)
Correctly handle embedded nulls and invalid bytes in UTF-8 strings without exceptions (Issue #1648)
Maintain existing API signatures while improving UTF-8 robustness (Issue #1648)

Assessment against linked issues: Out-of-scope changes

No out-of-scope changes detected related to the linked issue objectives.
"""


📜 Recent review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between bd08073 and ed12c02.

📒 Files selected for processing (1)
  • src/CLR/Core/CLR_RT_UnicodeHelper.cpp (9 hunks)
🧰 Additional context used
🧠 Learnings (2)
📓 Common learnings
Learnt from: josesimoes
PR: nanoframework/nf-interpreter#3144
File: azure-pipelines-templates/download-install-cmake.yml:1-3
Timestamp: 2025-04-07T14:39:17.549Z
Learning: In the nanoframework/nf-interpreter repository, Unix-style line endings (\n) are not mandatory, even for YAML files, despite static analysis tools like YAMLlint flagging them as errors.
src/CLR/Core/CLR_RT_UnicodeHelper.cpp (11)
Learnt from: frobijn
PR: nanoframework/nf-interpreter#3023
File: CMake/Modules/FindNF_NativeAssemblies.cmake:95-106
Timestamp: 2024-09-23T17:56:37.570Z
Learning: The `AddCorLibAssemblyVersion` macro in `CMake/Modules/FindNF_NativeAssemblies.cmake` does not require additional error handling for regex failures because the associated *.cpp files contain warnings against modification. Any changes to these files should be accompanied by corresponding updates to the CMake file.
Learnt from: frobijn
PR: nanoframework/nf-interpreter#3023
File: CMake/Modules/FindNF_NativeAssemblies.cmake:95-106
Timestamp: 2024-10-08T15:52:09.445Z
Learning: The `AddCorLibAssemblyVersion` macro in `CMake/Modules/FindNF_NativeAssemblies.cmake` does not require additional error handling for regex failures because the associated *.cpp files contain warnings against modification. Any changes to these files should be accompanied by corresponding updates to the CMake file.
Learnt from: josesimoes
PR: nanoframework/nf-interpreter#3074
File: src/CLR/Core/GarbageCollector_Info.cpp:107-167
Timestamp: 2025-01-22T03:38:57.394Z
Learning: In nanoFramework's memory management code, DataSize() validation is comprehensively handled through CLR_RT_HeapCluster::ValidateBlock() and other caller code. Additional size checks in ValidateCluster() are redundant as the validation is already performed at multiple levels.
Learnt from: josesimoes
PR: nanoframework/nf-interpreter#3074
File: src/CLR/Core/GarbageCollector_Info.cpp:107-167
Timestamp: 2025-01-22T03:38:57.394Z
Learning: In CLR_RT_GarbageCollector::ValidateCluster, DataSize() validation is already handled by ValidateBlock() and other caller code, making additional size checks redundant.
Learnt from: josesimoes
PR: nanoframework/nf-interpreter#3158
File: src/CLR/CorLib/corlib_native_System_Guid.cpp:29-30
Timestamp: 2025-04-29T11:34:00.204Z
Learning: For UUID version 4 (random) generation in .NET's byte order representation, the version bits should be set in byte 7 using `(buf[7] & 0x0F) | 0x40` and the variant bits should be set in byte 9 using `(buf[9] & 0x3F) | 0x80`. This differs from the standard RFC 4122 byte layout due to .NET's byte-swapping behavior.
Learnt from: josesimoes
PR: nanoframework/nf-interpreter#3023
File: targets/netcore/nanoFramework.nanoCLR/nanoCLR_native.cpp:191-225
Timestamp: 2024-10-08T15:52:09.445Z
Learning: In `nanoCLR_GetNativeAssemblyInformation`, there is no need to return the number of bytes written, as the memory buffer is zeroed, making the string buffer terminated.
Learnt from: josesimoes
PR: nanoframework/nf-interpreter#3023
File: targets/netcore/nanoFramework.nanoCLR/nanoCLR_native.cpp:191-225
Timestamp: 2024-09-25T11:28:38.536Z
Learning: In `nanoCLR_GetNativeAssemblyInformation`, there is no need to return the number of bytes written, as the memory buffer is zeroed, making the string buffer terminated.
Learnt from: josesimoes
PR: nanoframework/nf-interpreter#3172
File: src/CLR/Core/TypeSystem.cpp:971-999
Timestamp: 2025-05-14T17:26:54.181Z
Learning: In the nanoFramework codebase, when handling HRESULT values from function calls like `parser.Advance(elem)`, it's preferable to use `if (parser.Advance(elem) != S_OK)` for local error handling rather than using the `NANOCLR_CHECK_HRESULT` macro, which would immediately propagate errors up the call stack.
Learnt from: josesimoes
PR: nanoframework/nf-interpreter#3190
File: src/CLR/Core/TypeSystem.cpp:0-0
Timestamp: 2025-06-26T09:16:55.184Z
Learning: In nanoFramework's CLR attribute parsing (src/CLR/Core/TypeSystem.cpp), the sentinel value 0xFFFF in string tokens represents a null string. When encountered, this should result in a true null reference (using SetObjectReference(nullptr)) rather than an empty string instance, and the boxing operation should be skipped via early return.
Learnt from: josesimoes
PR: nanoframework/nf-interpreter#3023
File: targets/netcore/nanoFramework.nanoCLR/nanoCLR_native.cpp:191-225
Timestamp: 2024-09-25T11:28:38.536Z
Learning: When working with `nanoCLR_GetNativeAssemblyInformation`, fixed-size assembly names are required, so code that deals with variable-length names cannot be used.
Learnt from: josesimoes
PR: nanoframework/nf-interpreter#3023
File: targets/netcore/nanoFramework.nanoCLR/nanoCLR_native.cpp:191-225
Timestamp: 2024-10-12T19:00:39.000Z
Learning: When working with `nanoCLR_GetNativeAssemblyInformation`, fixed-size assembly names are required, so code that deals with variable-length names cannot be used.
⏰ Context from checks skipped due to timeout of 90000ms (3)
  • GitHub Check: nf-interpreter (Check_Code_Style)
  • GitHub Check: nf-interpreter (Nightly build) (Check_Build_Options)
  • GitHub Check: nf-interpreter (Check_Build_Options)
🔇 Additional comments (8)
src/CLR/Core/CLR_RT_UnicodeHelper.cpp (8)

9-25: LGTM! Improved UTF-8 validation macros.

The macro changes enhance clarity and correctness:

  • UTF8_VALID_CONTINUATION now uses explicit range validation instead of negative checks
  • UTF8_VALID_ASCII provides clear ASCII validation
  • The new helper macros standardize continuation byte processing

28-169: Excellent UTF-8 character counting implementation with proper validation.

The rewritten function correctly handles:

  • UTF-8 sequence validation for all byte lengths
  • Proper counting of supplementary characters as 2 UTF-16 code units
  • Invalid sequence handling by advancing over valid continuation bytes
  • Null termination handling

However, clarify the maxRemaining logic: the function decrements maxRemaining for each input byte consumed but checks maxRemaining > 0 in the loop condition. This appears to limit input bytes rather than output characters. Is this the intended behavior?


171-218: LGTM! Proper UTF-16 to UTF-8 byte counting with surrogate handling.

The function correctly:

  • Handles surrogate pairs and counts them as 4 bytes
  • Treats unpaired surrogates as 3 bytes (UTF-8 encoding of U+FFFD replacement character)
  • Maintains proper loop control with the max parameter

232-494: Excellent UTF-8 to UTF-16 conversion with proper invalid sequence handling.

The rewritten function significantly improves:

  • Structured validation with lookahead for continuation bytes
  • Proper U+FFFD replacement character output for invalid sequences
  • Correct surrogate pair generation for supplementary characters
  • Comprehensive range validation for all UTF-8 sequence types

The implementation correctly handles the issues mentioned in the PR objectives and maintains consistency with .NET 8 UTF-8 behavior.


501-593: LGTM! Robust backward UTF-8 navigation with proper validation.

The rewritten function correctly:

  • Validates UTF-8 sequence boundaries when moving backward
  • Handles incomplete sequences by treating invalid lead bytes as single characters
  • Only updates the position when the full requested character count is achieved
  • Provides clear return value semantics for partial vs. complete moves

595-743: LGTM! Enhanced UTF-16 to UTF-8 conversion with proper surrogate handling.

The function improvements include:

  • Proper handling of embedded null characters instead of early termination
  • Correct validation and pairing of surrogate pairs
  • U+FFFD replacement character output for unpaired surrogates
  • Comprehensive buffer size validation

These changes align with the PR objectives for robust UTF-8 encoding.


818-863: LGTM! Improved error handling and state management in UnicodeString.

The changes enhance robustness:

  • Input reset before conversion prevents stale state issues
  • Proper cleanup of allocated memory on conversion failure
  • Consistent state initialization in the Release method

These improvements prevent potential memory leaks and ensure proper object state management.


80-135: Confirmed: Duplicate continuation byte validation issues have been resolved.

The code now correctly validates each continuation byte individually:

  • 3-byte sequences: checks pSrc[0], then pSrc[1] sequentially
  • 4-byte sequences: checks pSrc[0], pSrc[1], pSrc[2] sequentially

This addresses the duplicate validation concerns raised in previous reviews.

✨ Finishing Touches
  • 📝 Generate Docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Explain this complex logic.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai explain this code block.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and explain its main purpose.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR.
  • @coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Nitpick comments (3)
src/CLR/Core/CLR_RT_UnicodeHelper.cpp (3)

89-104: Remove commented-out code.

The commented code block is unnecessary and reduces readability.

                 else
                 {
                     // Advance over valid continuation bytes in invalid sequence
-                    // int advance = 0;
-                    // if (UTF8_VALID_CONTINUATION(pSrc[0])) {
-                    //     advance++;
-                    //     if (UTF8_VALID_CONTINUATION(pSrc[1])) {
-                    //         advance++;
-                    //     }
-                    // }
-                    // pSrc += advance;
-                    // remaining -= advance;
-
                     pSrc += validCount;
                     remaining -= validCount;
                 }

277-286: Remove commented-out code.

Clean up the commented code for better readability.

             int validCount = 0;
             if (UTF8_VALID_CONTINUATION(ch2)) validCount++;
             if (UTF8_VALID_CONTINUATION(ch3)) validCount++;
            
             
             if (validCount <2) {
-                //inputUTF8 += validCount;
-                //iMaxBytes -= validCount;
                 goto invalid_sequence;
             }

329-334: Remove commented-out code.

Clean up the commented code.

             if (validCount < 3) {
                 // Not all continuation bytes are valid
-                //inputUTF8 += validCount;
-                //iMaxBytes -= validCount;
                 goto invalid_sequence;
             }
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between e8b977d and 2061946.

📒 Files selected for processing (3)
  • src/CLR/CorLib/corlib_native_System_String.cpp (1 hunks)
  • src/CLR/Core/CLR_RT_UnicodeHelper.cpp (7 hunks)
  • src/nanoFramework.Graphics/Graphics/Core/Support/Fonts/Font.cpp (2 hunks)
🧰 Additional context used
🧠 Learnings (3)
src/nanoFramework.Graphics/Graphics/Core/Support/Fonts/Font.cpp (2)
Learnt from: josesimoes
PR: nanoframework/nf-interpreter#3023
File: targets/netcore/nanoFramework.nanoCLR/nanoCLR_native.cpp:191-225
Timestamp: 2024-09-25T11:28:38.536Z
Learning: In `nanoCLR_GetNativeAssemblyInformation`, there is no need to return the number of bytes written, as the memory buffer is zeroed, making the string buffer terminated.
Learnt from: josesimoes
PR: nanoframework/nf-interpreter#3023
File: targets/netcore/nanoFramework.nanoCLR/nanoCLR_native.cpp:191-225
Timestamp: 2024-10-08T15:52:09.445Z
Learning: In `nanoCLR_GetNativeAssemblyInformation`, there is no need to return the number of bytes written, as the memory buffer is zeroed, making the string buffer terminated.
src/CLR/CorLib/corlib_native_System_String.cpp (1)
Learnt from: josesimoes
PR: nanoframework/nf-interpreter#3172
File: src/CLR/Core/TypeSystem.cpp:971-999
Timestamp: 2025-05-14T17:26:54.181Z
Learning: In the nanoFramework codebase, when handling HRESULT values from function calls like `parser.Advance(elem)`, it's preferable to use `if (parser.Advance(elem) != S_OK)` for local error handling rather than using the `NANOCLR_CHECK_HRESULT` macro, which would immediately propagate errors up the call stack.
src/CLR/Core/CLR_RT_UnicodeHelper.cpp (8)
Learnt from: josesimoes
PR: nanoframework/nf-interpreter#3074
File: src/CLR/Core/GarbageCollector_Info.cpp:107-167
Timestamp: 2025-01-22T03:38:57.394Z
Learning: In nanoFramework's memory management code, DataSize() validation is comprehensively handled through CLR_RT_HeapCluster::ValidateBlock() and other caller code. Additional size checks in ValidateCluster() are redundant as the validation is already performed at multiple levels.
Learnt from: frobijn
PR: nanoframework/nf-interpreter#3023
File: CMake/Modules/FindNF_NativeAssemblies.cmake:95-106
Timestamp: 2024-10-08T15:52:09.445Z
Learning: The `AddCorLibAssemblyVersion` macro in `CMake/Modules/FindNF_NativeAssemblies.cmake` does not require additional error handling for regex failures because the associated *.cpp files contain warnings against modification. Any changes to these files should be accompanied by corresponding updates to the CMake file.
Learnt from: frobijn
PR: nanoframework/nf-interpreter#3023
File: CMake/Modules/FindNF_NativeAssemblies.cmake:95-106
Timestamp: 2024-09-23T17:56:37.570Z
Learning: The `AddCorLibAssemblyVersion` macro in `CMake/Modules/FindNF_NativeAssemblies.cmake` does not require additional error handling for regex failures because the associated *.cpp files contain warnings against modification. Any changes to these files should be accompanied by corresponding updates to the CMake file.
Learnt from: josesimoes
PR: nanoframework/nf-interpreter#3023
File: targets/netcore/nanoFramework.nanoCLR/nanoCLR_native.cpp:191-225
Timestamp: 2024-10-08T15:52:09.445Z
Learning: In `nanoCLR_GetNativeAssemblyInformation`, there is no need to return the number of bytes written, as the memory buffer is zeroed, making the string buffer terminated.
Learnt from: josesimoes
PR: nanoframework/nf-interpreter#3023
File: targets/netcore/nanoFramework.nanoCLR/nanoCLR_native.cpp:191-225
Timestamp: 2024-09-25T11:28:38.536Z
Learning: In `nanoCLR_GetNativeAssemblyInformation`, there is no need to return the number of bytes written, as the memory buffer is zeroed, making the string buffer terminated.
Learnt from: josesimoes
PR: nanoframework/nf-interpreter#3023
File: targets/netcore/nanoFramework.nanoCLR/nanoCLR_native.cpp:191-225
Timestamp: 2024-10-12T19:00:39.000Z
Learning: When working with `nanoCLR_GetNativeAssemblyInformation`, fixed-size assembly names are required, so code that deals with variable-length names cannot be used.
Learnt from: josesimoes
PR: nanoframework/nf-interpreter#3023
File: targets/netcore/nanoFramework.nanoCLR/nanoCLR_native.cpp:191-225
Timestamp: 2024-09-25T11:28:38.536Z
Learning: When working with `nanoCLR_GetNativeAssemblyInformation`, fixed-size assembly names are required, so code that deals with variable-length names cannot be used.
Learnt from: josesimoes
PR: nanoframework/nf-interpreter#3190
File: src/CLR/Core/TypeSystem.cpp:0-0
Timestamp: 2025-06-26T09:16:55.184Z
Learning: In nanoFramework's CLR attribute parsing (src/CLR/Core/TypeSystem.cpp), the sentinel value 0xFFFF in string tokens represents a null string. When encountered, this should result in a true null reference (using SetObjectReference(nullptr)) rather than an empty string instance, and the boxing operation should be skipped via early return.
⏰ Context from checks skipped due to timeout of 90000ms (5)
  • GitHub Check: nf-interpreter (Nightly build) (Build_STM32_targets ST_NUCLEO64_F091RC)
  • GitHub Check: nf-interpreter (Nightly build) (Build_STM32_targets ORGPAL_PALX)
  • GitHub Check: nf-interpreter (Nightly build) (Build_STM32_targets ORGPAL_PALTHREE)
  • GitHub Check: nf-interpreter (Nightly build) (Check_Build_Options)
  • GitHub Check: nf-interpreter (Check_Build_Options)
🔇 Additional comments (10)
src/CLR/Core/CLR_RT_UnicodeHelper.cpp (6)

10-22: Improved UTF-8 validation macros.

The new positive validation macros are clearer and follow UTF-8 specifications correctly.


156-192: Correct UTF-16 byte counting with proper surrogate handling.

The implementation correctly handles surrogate pairs and invalid sequences, outputting appropriate byte counts for each case.


410-486: Well-implemented UTF-8 backward traversal.

The rewritten function correctly identifies UTF-8 sequence boundaries when moving backward and properly validates sequences.


488-637: Robust UTF-16 to UTF-8 conversion with proper error handling.

The rewritten function correctly handles surrogate pairs, unpaired surrogates, and buffer management. The replacement character (U+FFFD) is properly encoded for invalid sequences.


734-742: Proper error handling for UTF-8 conversion failure.

Good addition of error handling that properly cleans up allocated memory on conversion failure.


754-756: Fixed memory leak in Release method.

Setting m_wCharArray to NULL after release prevents use-after-free issues.

src/nanoFramework.Graphics/Graphics/Core/Support/Fonts/Font.cpp (2)

158-158: Correct use of actual UTF-8 character count.

Using the actual character count from the UTF-8 string ensures accurate iteration limits for font rendering.


329-329: Consistent character counting in width calculation.

Good consistency with the StringOut method - both now use the accurate UTF-8 character count.

src/CLR/CorLib/corlib_native_System_String.cpp (2)

588-628: New string matching helper looks correct.

The MatchString function properly handles UTF-8 to UTF-16 conversion and surrogate pairs for accurate string comparison.


245-425: Please verify the IndexOf helper implementation still exists

I searched for the static helper declared in src/CLR/CorLib/corlib_native.h:

  • Header declares
    static HRESULT IndexOf(CLR_RT_StackFrame &stack, int mode);

  • I could not locate its definition in any .cpp file under src/CLR/CorLib

Without that implementation, all the public IndexOf overloads would be broken.
Can you confirm whether the helper was moved or intentionally removed?

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (3)
src/CLR/Core/CLR_RT_UnicodeHelper.cpp (3)

74-75: Fix duplicate continuation byte check.

The code checks pSrc[0] twice instead of checking both continuation bytes.

-                 int validCount = 0;
-                if (UTF8_VALID_CONTINUATION(pSrc[0])) validCount++;
-                if (UTF8_VALID_CONTINUATION(pSrc[0]) && UTF8_VALID_CONTINUATION(pSrc[1])) validCount++;
+                 int validCount = 0;
+                if (UTF8_VALID_CONTINUATION(pSrc[0])) validCount++;
+                if (validCount == 1 && UTF8_VALID_CONTINUATION(pSrc[1])) validCount++;

115-121: Fix duplicate continuation byte checks for 4-byte sequences.

Similar to the 3-byte case, this checks pSrc[0] multiple times instead of checking all three continuation bytes.

-                if (UTF8_VALID_CONTINUATION(pSrc[0]))  validCount++;
-                if (UTF8_VALID_CONTINUATION(pSrc[0]) && UTF8_VALID_CONTINUATION(pSrc[1]))  validCount++;
-                if (UTF8_VALID_CONTINUATION(pSrc[0]) && UTF8_VALID_CONTINUATION(pSrc[1]) && 
-                    UTF8_VALID_CONTINUATION(pSrc[2]))  validCount++;
+                if (UTF8_VALID_CONTINUATION(pSrc[0]))  validCount++;
+                if (validCount == 1 && UTF8_VALID_CONTINUATION(pSrc[1]))  validCount++;
+                if (validCount == 2 && UTF8_VALID_CONTINUATION(pSrc[2]))  validCount++;

324-327: Fix duplicate continuation byte validation.

The code checks ch2 twice instead of checking different continuation bytes.

-            if (UTF8_VALID_CONTINUATION(ch2)) validCount++;
-            if (UTF8_VALID_CONTINUATION(ch2) && UTF8_VALID_CONTINUATION(ch3))  validCount++;
-            if (UTF8_VALID_CONTINUATION(ch2) &&UTF8_VALID_CONTINUATION(ch3) && 
-                UTF8_VALID_CONTINUATION(ch4))  validCount++;
+            if (UTF8_VALID_CONTINUATION(ch2)) validCount++;
+            if (validCount == 1 && UTF8_VALID_CONTINUATION(ch3))  validCount++;
+            if (validCount == 2 && UTF8_VALID_CONTINUATION(ch4))  validCount++;
🧹 Nitpick comments (2)
src/CLR/Core/CLR_RT_UnicodeHelper.cpp (2)

31-31: Consider potential integer overflow edge case.

While 0x7fffffff is a reasonable large value, consider using INT_MAX for consistency and clarity about the intent.

-    int remaining = (max < 0) ? 0x7fffffff : max;
+    int remaining = (max < 0) ? INT_MAX : max;

214-216: Consider using INT_MAX for consistency.

For better code clarity and consistency with standard practices.

-    if (iMaxBytes == -1) {
-        iMaxBytes = INT_MAX;
-    }
+    if (iMaxBytes == -1) {
+        iMaxBytes = INT_MAX;
+    }
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between d11b44b and 650bf47.

📒 Files selected for processing (1)
  • src/CLR/Core/CLR_RT_UnicodeHelper.cpp (7 hunks)
🧰 Additional context used
🧠 Learnings (2)
📓 Common learnings
Learnt from: josesimoes
PR: nanoframework/nf-interpreter#3144
File: azure-pipelines-templates/download-install-cmake.yml:1-3
Timestamp: 2025-04-07T14:39:17.549Z
Learning: In the nanoframework/nf-interpreter repository, Unix-style line endings (\n) are not mandatory, even for YAML files, despite static analysis tools like YAMLlint flagging them as errors.
src/CLR/Core/CLR_RT_UnicodeHelper.cpp (10)
Learnt from: josesimoes
PR: nanoframework/nf-interpreter#3074
File: src/CLR/Core/GarbageCollector_Info.cpp:107-167
Timestamp: 2025-01-22T03:38:57.394Z
Learning: In nanoFramework's memory management code, DataSize() validation is comprehensively handled through CLR_RT_HeapCluster::ValidateBlock() and other caller code. Additional size checks in ValidateCluster() are redundant as the validation is already performed at multiple levels.
Learnt from: frobijn
PR: nanoframework/nf-interpreter#3023
File: CMake/Modules/FindNF_NativeAssemblies.cmake:95-106
Timestamp: 2024-10-08T15:52:09.445Z
Learning: The `AddCorLibAssemblyVersion` macro in `CMake/Modules/FindNF_NativeAssemblies.cmake` does not require additional error handling for regex failures because the associated *.cpp files contain warnings against modification. Any changes to these files should be accompanied by corresponding updates to the CMake file.
Learnt from: frobijn
PR: nanoframework/nf-interpreter#3023
File: CMake/Modules/FindNF_NativeAssemblies.cmake:95-106
Timestamp: 2024-09-23T17:56:37.570Z
Learning: The `AddCorLibAssemblyVersion` macro in `CMake/Modules/FindNF_NativeAssemblies.cmake` does not require additional error handling for regex failures because the associated *.cpp files contain warnings against modification. Any changes to these files should be accompanied by corresponding updates to the CMake file.
Learnt from: josesimoes
PR: nanoframework/nf-interpreter#3074
File: src/CLR/Core/GarbageCollector_Info.cpp:107-167
Timestamp: 2025-01-22T03:38:57.394Z
Learning: In CLR_RT_GarbageCollector::ValidateCluster, DataSize() validation is already handled by ValidateBlock() and other caller code, making additional size checks redundant.
Learnt from: josesimoes
PR: nanoframework/nf-interpreter#3172
File: src/CLR/Core/TypeSystem.cpp:971-999
Timestamp: 2025-05-14T17:26:54.181Z
Learning: In the nanoFramework codebase, when handling HRESULT values from function calls like `parser.Advance(elem)`, it's preferable to use `if (parser.Advance(elem) != S_OK)` for local error handling rather than using the `NANOCLR_CHECK_HRESULT` macro, which would immediately propagate errors up the call stack.
Learnt from: josesimoes
PR: nanoframework/nf-interpreter#3023
File: targets/netcore/nanoFramework.nanoCLR/nanoCLR_native.cpp:191-225
Timestamp: 2024-09-25T11:28:38.536Z
Learning: In `nanoCLR_GetNativeAssemblyInformation`, there is no need to return the number of bytes written, as the memory buffer is zeroed, making the string buffer terminated.
Learnt from: josesimoes
PR: nanoframework/nf-interpreter#3023
File: targets/netcore/nanoFramework.nanoCLR/nanoCLR_native.cpp:191-225
Timestamp: 2024-10-08T15:52:09.445Z
Learning: In `nanoCLR_GetNativeAssemblyInformation`, there is no need to return the number of bytes written, as the memory buffer is zeroed, making the string buffer terminated.
Learnt from: josesimoes
PR: nanoframework/nf-interpreter#3023
File: targets/netcore/nanoFramework.nanoCLR/nanoCLR_native.cpp:191-225
Timestamp: 2024-10-12T19:00:39.000Z
Learning: When working with `nanoCLR_GetNativeAssemblyInformation`, fixed-size assembly names are required, so code that deals with variable-length names cannot be used.
Learnt from: josesimoes
PR: nanoframework/nf-interpreter#3023
File: targets/netcore/nanoFramework.nanoCLR/nanoCLR_native.cpp:191-225
Timestamp: 2024-09-25T11:28:38.536Z
Learning: When working with `nanoCLR_GetNativeAssemblyInformation`, fixed-size assembly names are required, so code that deals with variable-length names cannot be used.
Learnt from: josesimoes
PR: nanoframework/nf-interpreter#3190
File: src/CLR/Core/TypeSystem.cpp:0-0
Timestamp: 2025-06-26T09:16:55.184Z
Learning: In nanoFramework's CLR attribute parsing (src/CLR/Core/TypeSystem.cpp), the sentinel value 0xFFFF in string tokens represents a null string. When encountered, this should result in a true null reference (using SetObjectReference(nullptr)) rather than an empty string instance, and the boxing operation should be skipped via early return.
⏰ Context from checks skipped due to timeout of 90000ms (2)
  • GitHub Check: nf-interpreter (Nightly build) (Check_Build_Options)
  • GitHub Check: nf-interpreter (Check_Build_Options)
🔇 Additional comments (7)
src/CLR/Core/CLR_RT_UnicodeHelper.cpp (7)

10-11: LGTM: Improved UTF-8 validation macros.

The change from negative to positive validation checks makes the code more explicit and readable.


173-185: LGTM: Proper surrogate pair and invalid surrogate handling.

The logic correctly handles valid surrogate pairs (4 bytes) and invalid/unpaired surrogates (3 bytes for replacement character).


374-387: LGTM: Proper invalid sequence handling with replacement character.

The use of U+FFFD (0xFFFD) as replacement character aligns with Unicode standards and .NET 8 behavior as stated in the PR objectives.


411-486: LGTM: Robust backward UTF-8 navigation.

The rewritten function properly handles UTF-8 sequence boundaries, validates continuation bytes, and ensures atomic character movement. The logic correctly treats invalid lead bytes as single characters.


534-608: LGTM: Comprehensive surrogate pair handling.

The code properly handles both valid surrogate pairs and unpaired surrogates by outputting replacement characters (U+FFFD) for invalid cases, aligning with Unicode standards.


734-742: LGTM: Proper error handling in conversion.

The function correctly resets the UTF-8 input before conversion and properly cleans up allocated memory on conversion failure, preventing memory leaks.


754-756: LGTM: Consistent resource cleanup.

The Release method properly nulls the pointer and resets the length, ensuring consistent state after cleanup.

alberk8 and others added 6 commits June 30, 2025 12:38
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
Automated fixes for code style.
…50ea-cc3d-4001-bfa7-f9c5cade556f

Code style fixes for nanoframework/nf-interpreter PR#3191
Copy link
Member

@josesimoes josesimoes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Albert, this is looking great. Well done!
Mostly style comments in the review. Nothing personal, just trying to keep the consistency and code quality. 😅

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (7)
src/CLR/Core/CLR_RT_UnicodeHelper.cpp (7)

33-33: Clarify variable name and magic number.

The variable name remainingChar could be more descriptive, and the magic number 0x7fffffff needs explanation.

Consider renaming to remainingChars or maxRemaining and adding a comment explaining why 0x7fffffff is used as the default maximum.

-    int remainingChar = (max < 0) ? 0x7fffffff : max;
+    int remainingChars = (max < 0) ? 0x7fffffff : max; // Use max int when no limit specified

40-43: Add explicit braces for consistency.

Missing braces around the single-statement if block.

-        if (ch == 0) 
+        if (ch == 0)
+        {
             break;
+        }

122-122: Add explicit braces for consistency.

Missing braces around the single-statement if block.

-                if (UTF8_VALID_CONTINUATION(pSrc[0]))
+                if (UTF8_VALID_CONTINUATION(pSrc[0]))
+                {
                     validCount++;
+                }

273-276: Add explicit braces for consistency.

Missing braces around the goto statement.

-            if (iMaxBytes < 1)
+            if (iMaxBytes < 1)
+            {
                 goto invalid_sequence;
+            }

471-473: Add explicit braces for consistency.

Missing braces around the return statement.

-        if (outputUTF16_size < 1)
+        if (outputUTF16_size < 1)
+        {
             return false;
+        }

78-87: Fix flawed continuation byte validation logic.

The 3-byte sequence validation has a critical flaw - it checks both pSrc[0] and pSrc[1] independently but only increments validCount once when both are valid. This doesn't properly track individual byte validity.

Apply this fix to validate continuation bytes sequentially:

                 int validCount = 0;
                 if (UTF8_VALID_CONTINUATION(pSrc[0]))
                 {
                     validCount++;
                 }

-                if (UTF8_VALID_CONTINUATION(pSrc[0]) && UTF8_VALID_CONTINUATION(pSrc[1]))
+                if (validCount == 1 && UTF8_VALID_CONTINUATION(pSrc[1]))
                 {
                     validCount++;
                 }

324-333: Fix duplicate continuation byte validation in 3-byte sequences.

Similar to the CountNumberOfCharacters function, this validation logic is flawed - it checks ch2 and ch3 independently rather than sequentially.

Apply this fix to validate continuation bytes properly:

             int validCount = 0;
             if (UTF8_VALID_CONTINUATION(ch2))
             {
                 validCount++;
             }

-            if (UTF8_VALID_CONTINUATION(ch3))
+            if (validCount == 1 && UTF8_VALID_CONTINUATION(ch3))
             {
                 validCount++;
             }
🧹 Nitpick comments (1)
src/CLR/Core/CLR_RT_UnicodeHelper.cpp (1)

175-179: Improve readability of counter logic.

The counter decrement logic could be clearer about when max is decremented.

         CLR_UINT16 ch = *pSrc++;
-        if (max > 0)
-            max--;
+        if (max > 0)
+        {
+            max--;
+        }
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between d75d9cc and e38a36d.

📒 Files selected for processing (3)
  • src/CLR/CorLib/corlib_native_System_String.cpp (5 hunks)
  • src/CLR/Core/CLR_RT_UnicodeHelper.cpp (9 hunks)
  • src/nanoFramework.Graphics/Graphics/Core/Support/Fonts/Font.cpp (2 hunks)
✅ Files skipped from review due to trivial changes (1)
  • src/nanoFramework.Graphics/Graphics/Core/Support/Fonts/Font.cpp
🚧 Files skipped from review as they are similar to previous changes (1)
  • src/CLR/CorLib/corlib_native_System_String.cpp
🧰 Additional context used
🧠 Learnings (1)
src/CLR/Core/CLR_RT_UnicodeHelper.cpp (11)
Learnt from: josesimoes
PR: nanoframework/nf-interpreter#3074
File: src/CLR/Core/GarbageCollector_Info.cpp:107-167
Timestamp: 2025-01-22T03:38:57.394Z
Learning: In nanoFramework's memory management code, DataSize() validation is comprehensively handled through CLR_RT_HeapCluster::ValidateBlock() and other caller code. Additional size checks in ValidateCluster() are redundant as the validation is already performed at multiple levels.
Learnt from: frobijn
PR: nanoframework/nf-interpreter#3023
File: CMake/Modules/FindNF_NativeAssemblies.cmake:95-106
Timestamp: 2024-09-23T17:56:37.570Z
Learning: The `AddCorLibAssemblyVersion` macro in `CMake/Modules/FindNF_NativeAssemblies.cmake` does not require additional error handling for regex failures because the associated *.cpp files contain warnings against modification. Any changes to these files should be accompanied by corresponding updates to the CMake file.
Learnt from: frobijn
PR: nanoframework/nf-interpreter#3023
File: CMake/Modules/FindNF_NativeAssemblies.cmake:95-106
Timestamp: 2024-10-08T15:52:09.445Z
Learning: The `AddCorLibAssemblyVersion` macro in `CMake/Modules/FindNF_NativeAssemblies.cmake` does not require additional error handling for regex failures because the associated *.cpp files contain warnings against modification. Any changes to these files should be accompanied by corresponding updates to the CMake file.
Learnt from: josesimoes
PR: nanoframework/nf-interpreter#3074
File: src/CLR/Core/GarbageCollector_Info.cpp:107-167
Timestamp: 2025-01-22T03:38:57.394Z
Learning: In CLR_RT_GarbageCollector::ValidateCluster, DataSize() validation is already handled by ValidateBlock() and other caller code, making additional size checks redundant.
Learnt from: josesimoes
PR: nanoframework/nf-interpreter#3158
File: src/CLR/CorLib/corlib_native_System_Guid.cpp:29-30
Timestamp: 2025-04-29T11:34:00.204Z
Learning: For UUID version 4 (random) generation in .NET's byte order representation, the version bits should be set in byte 7 using `(buf[7] & 0x0F) | 0x40` and the variant bits should be set in byte 9 using `(buf[9] & 0x3F) | 0x80`. This differs from the standard RFC 4122 byte layout due to .NET's byte-swapping behavior.
Learnt from: josesimoes
PR: nanoframework/nf-interpreter#3023
File: targets/netcore/nanoFramework.nanoCLR/nanoCLR_native.cpp:191-225
Timestamp: 2024-10-08T15:52:09.445Z
Learning: In `nanoCLR_GetNativeAssemblyInformation`, there is no need to return the number of bytes written, as the memory buffer is zeroed, making the string buffer terminated.
Learnt from: josesimoes
PR: nanoframework/nf-interpreter#3023
File: targets/netcore/nanoFramework.nanoCLR/nanoCLR_native.cpp:191-225
Timestamp: 2024-09-25T11:28:38.536Z
Learning: In `nanoCLR_GetNativeAssemblyInformation`, there is no need to return the number of bytes written, as the memory buffer is zeroed, making the string buffer terminated.
Learnt from: josesimoes
PR: nanoframework/nf-interpreter#3172
File: src/CLR/Core/TypeSystem.cpp:971-999
Timestamp: 2025-05-14T17:26:54.181Z
Learning: In the nanoFramework codebase, when handling HRESULT values from function calls like `parser.Advance(elem)`, it's preferable to use `if (parser.Advance(elem) != S_OK)` for local error handling rather than using the `NANOCLR_CHECK_HRESULT` macro, which would immediately propagate errors up the call stack.
Learnt from: josesimoes
PR: nanoframework/nf-interpreter#3190
File: src/CLR/Core/TypeSystem.cpp:0-0
Timestamp: 2025-06-26T09:16:55.184Z
Learning: In nanoFramework's CLR attribute parsing (src/CLR/Core/TypeSystem.cpp), the sentinel value 0xFFFF in string tokens represents a null string. When encountered, this should result in a true null reference (using SetObjectReference(nullptr)) rather than an empty string instance, and the boxing operation should be skipped via early return.
Learnt from: josesimoes
PR: nanoframework/nf-interpreter#3023
File: targets/netcore/nanoFramework.nanoCLR/nanoCLR_native.cpp:191-225
Timestamp: 2024-09-25T11:28:38.536Z
Learning: When working with `nanoCLR_GetNativeAssemblyInformation`, fixed-size assembly names are required, so code that deals with variable-length names cannot be used.
Learnt from: josesimoes
PR: nanoframework/nf-interpreter#3023
File: targets/netcore/nanoFramework.nanoCLR/nanoCLR_native.cpp:191-225
Timestamp: 2024-10-12T19:00:39.000Z
Learning: When working with `nanoCLR_GetNativeAssemblyInformation`, fixed-size assembly names are required, so code that deals with variable-length names cannot be used.
🧬 Code Graph Analysis (1)
src/CLR/Core/CLR_RT_UnicodeHelper.cpp (1)
src/CLR/Include/nanoCLR_Runtime.h (4)
  • utf8StringStart (940-940)
  • string (959-959)
  • m_wCharArray (960-963)
  • m_length (964-967)
⏰ Context from checks skipped due to timeout of 90000ms (3)
  • GitHub Check: nf-interpreter (Nightly build) (Check_Build_Options)
  • GitHub Check: nf-interpreter (Nightly build) (Check_Build_Options)
  • GitHub Check: nf-interpreter (Check_Build_Options)
🔇 Additional comments (9)
src/CLR/Core/CLR_RT_UnicodeHelper.cpp (9)

9-25: LGTM: Well-defined UTF-8 validation macros.

The explicit positive validation macros for ASCII and continuation bytes are clear and correct, replacing the previous negative checks.


203-206: LGTM: Proper handling of unpaired low surrogates.

The addition of explicit handling for unpaired low surrogates (counting them as 3 bytes for replacement character) is correct and aligns with Unicode standards.


448-466: LGTM: Structured error handling for invalid sequences.

The replacement of goto-based error handling with structured control flow and consistent U+FFFD output is a significant improvement over the previous implementation.


490-582: LGTM: Robust backward UTF-8 navigation.

The complete rewrite of MoveBackwardInUTF8 with proper sequence boundary detection, validation, and boundary checking is a significant improvement. The logic correctly handles:

  • UTF-8 sequence start detection
  • Continuation byte validation
  • Boundary checking
  • Partial move handling

596-597: LGTM: Proper embedded null handling.

Removing the early break on null characters allows proper handling of embedded nulls in UTF-16 strings, which is correct behavior.


630-647: LGTM: Correct unpaired surrogate handling.

The explicit handling of unpaired high surrogates with U+FFFD replacement character output aligns with Unicode standards and .NET behavior.


687-703: LGTM: Complete surrogate pair handling.

The addition of explicit unpaired low surrogate handling completes the surrogate pair validation and ensures consistent U+FFFD output for invalid sequences.


829-837: LGTM: Improved error handling in Assign method.

The addition of input reset and proper cleanup on conversion failure prevents memory leaks and ensures consistent state.


849-851: LGTM: Complete state reset in Release method.

Setting both m_wCharArray to NULL and m_length to 0 ensures complete object state reset.

nfbot and others added 2 commits July 1, 2025 09:54
Automated fixes for code style.
…844b-406f-4fb2-a2d5-6c62fe1def26

Code style fixes for nanoframework/nf-interpreter PR#3191
@josesimoes
Copy link
Member

@alberk8 please take a look at coderabbitai comments above. It may have found some edge cases...

alberk8 and others added 4 commits July 1, 2025 18:41
@alberk8 alberk8 requested a review from josesimoes July 2, 2025 02:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Encoding.UTF8.GetString cause an exception
4 participants