Use Erlang string case folding. Add word mappings to normalize#111
Merged
Use Erlang string case folding. Add word mappings to normalize#111
Conversation
vkatsuba
approved these changes
Nov 13, 2025
Contributor
There was a problem hiding this comment.
Pull Request Overview
This pull request introduces a new module for comprehensive Unicode string normalization and refactors the existing case conversion functions to use Erlang's built-in Unicode-aware string functions.
Key Changes:
- Created
z_string_normalizemodule with normalize/1 function that performs lowercasing viastring:casefold/1, sanitization, and transliteration to ASCII for multiple language scripts - Refactored
z_string:to_lower/1andto_upper/1to usestring:casefold/1andstring:uppercase/1respectively, replacing custom character-by-character conversion logic - Added customizable word mapping system via CSV file that uses
persistent_termfor efficient lookups, enabling language-specific transliterations (e.g., Ukrainian city names)
Reviewed Changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 9 comments.
Show a summary per file
| File | Description |
|---|---|
src/z_string_normalize.erl |
New module implementing Unicode normalization with transliteration rules for Cyrillic, Ukrainian, Polish, Turkish, and Hebrew scripts, plus custom word mapping support |
src/z_string.erl |
Simplified to_lower/1 and to_upper/1 to delegate to built-in Erlang string functions; updated normalize/1 to call new z_string_normalize module |
test/z_string_test.erl |
Added test case to verify word mapping functionality (Cyrillic "Одесса" → "odesa") |
priv/normalize-words-mapping.csv |
CSV file containing custom word mappings for city names in multiple languages |
src/z_mochinum.erl |
Minor test improvement: added explicit positive sign to floating-point zero literal for clarity |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Co-authored-by: Copilot <[email protected]>
Co-authored-by: Copilot <[email protected]>
Co-authored-by: Copilot <[email protected]>
Co-authored-by: Copilot <[email protected]>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This pull request introduces a new module,
z_string_normalize, which provides comprehensive Unicode string normalization and transliteration to ASCII, supporting multiple languages and custom word mappings. It also adds a test to verify specific word normalization behavior. This module has been split from z_string.In z_string the functions
to_lower/1andto_upper/1now usestring:foldcase/1andstring:uppercase/1instead of their own mappings.String normalization and transliteration:
z_string_normalizewith anormalize/1function that lowercases, sanitizes, and transliterates Unicode strings to ASCII, including support for Cyrillic, Ukrainian, Polish, Turkish, and Hebrew scripts. The normalization also handles HTML entities and various accented characters.persistent_term, allowing for efficient and customizable normalization of specific words (e.g., language-specific transliterations).Testing:
normalize_map_words_testto ensure that the normalization correctly maps "Одесса" (in Cyrillic) to "odesa" using the custom word mapping.