Add Unicode Normalization for Search Indexing #13384

tokuhirom · 2025-02-23T15:02:27Z

Introduced html_search_unicode_normalization configuration option to specify Unicode normalization form (NFC, NFD, NFKC, NFKD) for search indexing.
Updated the HTML builder to pass the normalization configuration to the search indexer.
Modified the IndexBuilder and _feed_visit_nodes functions to apply the specified Unicode normalization to document text before indexing.
Updated JavaScript search tools to normalize search queries using the specified normalization form.
Added documentation for the new configuration option in configuration.rst.
Implemented a test case to verify that full-width characters like 'Ｐｙｔｈｏｎ' are normalized and indexed as 'python'.

Purpose

This pull request introduces Unicode normalization for search indexing, allowing for consistent handling of text across different Unicode representations. This change improves the accuracy and reliability of search results.

Context and Background

The need for normalization was identified to handle cases where text input might vary in form, such as full-width and half-width characters in Japanese text.

- Introduced `html_search_unicode_normalization` configuration option to specify Unicode normalization form (NFC, NFD, NFKC, NFKD) for search indexing. - Updated the HTML builder to pass the normalization configuration to the search indexer. - Modified the `IndexBuilder` and `_feed_visit_nodes` functions to apply the specified Unicode normalization to document text before indexing. - Updated JavaScript search tools to normalize search queries using the specified normalization form. - Added documentation for the new configuration option in `configuration.rst`. - Implemented a test case to verify that full-width characters like 'Ｐｙｔｈｏｎ' are normalized and indexed as 'python'.

tokuhirom · 2025-02-23T15:33:32Z

Note:
String.normalize was supported by all modern browsers.
https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/String/normalize

jayaddison · 2025-02-24T18:05:21Z

Thank you @tokuhirom - the suggested changes look good to me. Is it possible to add a test to the tests/js directory to confirm that a client query behaves as expected (accompanying the Python test coverage)? Let me know if any updates to the test framework would help to support that.

AA-Turner · 2025-02-24T19:28:05Z

Do we need a config option? Can we just make normalisation the default?

tokuhirom · 2025-02-25T03:36:43Z

@jayaddison
thanks, i added test case to the searchtools.spec.js. it checks the code automatically normalize the query.

tokuhirom · 2025-02-25T03:39:44Z

Thanks for the suggestion, @AA-Turner! I see where you're coming from about making normalization the default. Here’s why I think keeping it as an option for now makes sense:

Backward Compatibility: We want to make sure we’re not breaking anything for folks already using the current setup.
Language-Specific Needs: Different languages might need different normalization types. For instance, as a Japanese speaker, I find NFKC or NFKD works best for handling full-width and half-width characters. Other languages might have their own preferences.

So, for now, I think having it as an option gives everyone the flexibility they need.

AA-Turner · 2025-02-25T03:53:48Z

@tokuhirom are you using an LLM, out of interest?

tokuhirom · 2025-02-25T03:57:45Z

@AA-Turner
i'm using LLM to write English since i'm not a native speaker of english(it helps me).
i'm writing the code by myself since i can write better code than LLMs :p

AA-Turner · 2025-02-25T04:02:00Z

Sorry to ask the question -- there are more and more low-quality contributions driven by LLMs/bots, it's nice to know a human is on the other side! Thank you for the PR, too!

I still think normalising the search index by default makes sense, as eg searching for full width 'Python' vs ASCII 'Python' should produce the same results. It should be fine as long as we also normalise the search query, unless you can think of a counterexample?

A

tokuhirom · 2025-02-25T04:09:53Z

np ;p

i can't found any counter example. but i'm not familiar with other languages than Japanese and English. there's no issue about to set the default value. but i don't know about the best default value :)

tests/js/searchtools.spec.js

jayaddison · 2025-02-25T11:16:11Z

tests/js/searchtools.spec.js

+      const orig = DOCUMENTATION_OPTIONS.SEARCH_UNICODE_NORMALIZATION;
+      try {
+        DOCUMENTATION_OPTIONS.SEARCH_UNICODE_NORMALIZATION = 'NFKC';
+        [_searchQuery, searchterms, excluded, ..._remainingItems] = Search._parseQuery('Ｓｐｈｉｎｘ');


An optional suggestion: perhaps we could run a second query using Search._parseQuery('sphinx') (no normalization), and then assert that the results are identical?

i see. i added it in 6fb1fb7

tests/js/searchtools.spec.js

tokuhirom · 2025-02-28T06:25:36Z

@AA-Turner
I set NFKD as a default unicode normalization :)

jayaddison · 2025-03-05T10:53:15Z

tests/js/searchtools.spec.js

+
+        expect(halfWidthQuery).toEqual(fullWidthQuery);
+      } finally {
+        DOCUMENTATION_OPTIONS.SEARCH_UNICODE_NORMALIZATION = orig; // restore


Hmm. Copying and then restoring the normalization option here seems inelegant. This may be a limitation of the test suite at the moment -- the fixtures vary for each test, but the documentation options do not.

In addition, I am considering how this might interact with #13395 or similar features in future.

I think the arrangement is OK for the moment -- but perhaps we should consider extracting a separate Search._normalizeQuery function before _parseQuery occurs.

What do you think @tokuhirom?

at first, the root cause is the Search module depends on the global variable. It makes hard to do the unit testing.
I mean, if sphinx have an API like this,

<script src="searchtools.js"></script> <script> new Search({ fileSuffix: '{{ file_suffix }}', ... }).initialize(); </script>

I suggest to rewrite the APIs like this. but so, it's the too big deal.

at the moment, i moved the cleanup function to afterEach block. and split the _normalizeQuery method.

That sounds good to me; thank you.

With those changes, the def normalize function in the Python code will be a mirror of the function _normalizeQuery in the searchtools.js code. So therefore: it might be worth adding a comment in the Python code to explain that (but not in the JS code -- because not everyone reading the JS code will be able to view the Python code).

PS: I think an additional git push may be required (your explanation makes sense, but I don't find the updated commits in the branch at the moment)

oops. pushed.

This reverts commit 20d2749.

…s equivalently' test case.

Split _normalizeQuery method.

jayaddison

I think it might makes sense to remove the if (...) condition from the _parseQuery function and relocate that into _normalizeQuery.

Basically: always follow the same code/function call path, but sometimes some steps may be no-ops.

jayaddison · 2025-03-10T10:02:32Z

sphinx/themes/basic/static/searchtools.js

+  _normalizeQuery: (query, form) => {
+      return query.normalize(form);
+  },


Suggested change

_normalizeQuery: (query, form) => {

return query.normalize(form);

},

_normalizeQuery: (query) => {

const form = DOCUMENTATION_OPTIONS.SEARCH_UNICODE_NORMALIZATION;

if (!form) return query;

return query.normalize(form);

},

jayaddison · 2025-03-10T10:03:38Z

sphinx/themes/basic/static/searchtools.js

  _parseQuery: (query) => {
+    if (DOCUMENTATION_OPTIONS.SEARCH_UNICODE_NORMALIZATION) {
+        query = Search._normalizeQuery(query, DOCUMENTATION_OPTIONS.SEARCH_UNICODE_NORMALIZATION);
+    }
+
    // stem the search terms and add them to the correct list


Suggested change

_parseQuery: (query) => {

if (DOCUMENTATION_OPTIONS.SEARCH_UNICODE_NORMALIZATION) {

query = Search._normalizeQuery(query, DOCUMENTATION_OPTIONS.SEARCH_UNICODE_NORMALIZATION);

}

// stem the search terms and add them to the correct list

_parseQuery: (query) => {

query = Search._normalizeQuery(query);

// stem the search terms and add them to the correct list

tokuhirom force-pushed the feature-html_search_unicode_normalization branch 3 times, most recently from bf89eb3 to ea7a39b Compare February 23, 2025 15:17

tokuhirom force-pushed the feature-html_search_unicode_normalization branch from ea7a39b to a228fad Compare February 23, 2025 15:20

AA-Turner added this to the 8.3.0 milestone Feb 23, 2025

add js test case for unicode normalization

629820f

Merge branch 'master' into feature-html_search_unicode_normalization

1c2443e

jayaddison reviewed Feb 25, 2025

View reviewed changes

tests/js/searchtools.spec.js Outdated Show resolved Hide resolved

jayaddison reviewed Feb 25, 2025

View reviewed changes

Use single rst files to test the normalization.

6fb1fb7

jayaddison reviewed Feb 27, 2025

View reviewed changes

tests/js/searchtools.spec.js Outdated Show resolved Hide resolved

tokuhirom added 2 commits February 28, 2025 15:05

just test the return value of the _parseQuery.

20d2749

Make 'NFKD' as a default value of the unicode normalization

fac3a53

tokuhirom added 2 commits March 3, 2025 11:03

Merge branch 'master' into feature-html_search_unicode_normalization

79069f2

Merge branch 'master' into feature-html_search_unicode_normalization

9476676

jayaddison reviewed Mar 5, 2025

View reviewed changes

tokuhirom added 3 commits March 10, 2025 07:09

Revert "just test the return value of the _parseQuery."

2af0301

This reverts commit 20d2749.

re-add 'should parse queries with half-width and full-width character…

9ae1e39

…s equivalently' test case.

move cleanup process to the afterEach step.

f03b8e1

Split _normalizeQuery method.

Merge branch 'master' into feature-html_search_unicode_normalization

661ee11

jayaddison reviewed Mar 10, 2025

View reviewed changes

Uh oh!

Add Unicode Normalization for Search Indexing #13384

Are you sure you want to change the base?

Add Unicode Normalization for Search Indexing #13384

Uh oh!

Conversation

tokuhirom commented Feb 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Context and Background

Uh oh!

tokuhirom commented Feb 23, 2025

Uh oh!

jayaddison commented Feb 24, 2025

Uh oh!

AA-Turner commented Feb 24, 2025

Uh oh!

tokuhirom commented Feb 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tokuhirom commented Feb 25, 2025

Uh oh!

AA-Turner commented Feb 25, 2025

Uh oh!

tokuhirom commented Feb 25, 2025

Uh oh!

AA-Turner commented Feb 25, 2025

Uh oh!

tokuhirom commented Feb 25, 2025

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

tokuhirom commented Feb 28, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jayaddison left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

tokuhirom commented Feb 23, 2025 •

edited

Loading

tokuhirom commented Feb 25, 2025 •

edited

Loading