Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Unicode Normalization for Search Indexing #13384

Open
wants to merge 12 commits into
base: master
Choose a base branch
from

Conversation

tokuhirom
Copy link

@tokuhirom tokuhirom commented Feb 23, 2025

  • Introduced html_search_unicode_normalization configuration option to specify Unicode normalization form (NFC, NFD, NFKC, NFKD) for search indexing.
  • Updated the HTML builder to pass the normalization configuration to the search indexer.
  • Modified the IndexBuilder and _feed_visit_nodes functions to apply the specified Unicode normalization to document text before indexing.
  • Updated JavaScript search tools to normalize search queries using the specified normalization form.
  • Added documentation for the new configuration option in configuration.rst.
  • Implemented a test case to verify that full-width characters like 'Python' are normalized and indexed as 'python'.

Purpose

This pull request introduces Unicode normalization for search indexing, allowing for consistent handling of text across different Unicode representations. This change improves the accuracy and reliability of search results.

Context and Background

The need for normalization was identified to handle cases where text input might vary in form, such as full-width and half-width characters in Japanese text.

@tokuhirom tokuhirom force-pushed the feature-html_search_unicode_normalization branch 3 times, most recently from bf89eb3 to ea7a39b Compare February 23, 2025 15:17
-  Introduced `html_search_unicode_normalization` configuration option to specify Unicode normalization form (NFC, NFD, NFKC, NFKD) for search indexing.
-  Updated the HTML builder to pass the normalization configuration to the search indexer.
-  Modified the `IndexBuilder` and `_feed_visit_nodes` functions to apply the specified Unicode normalization to document text before indexing.
-  Updated JavaScript search tools to normalize search queries using the specified normalization form.
-  Added documentation for the new configuration option in `configuration.rst`.
-  Implemented a test case to verify that full-width characters like 'Python' are normalized and indexed as 'python'.
@tokuhirom tokuhirom force-pushed the feature-html_search_unicode_normalization branch from ea7a39b to a228fad Compare February 23, 2025 15:20
@tokuhirom
Copy link
Author

Note:
String.normalize was supported by all modern browsers.
https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/String/normalize

@AA-Turner AA-Turner added this to the 8.3.0 milestone Feb 23, 2025
@jayaddison
Copy link
Contributor

Thank you @tokuhirom - the suggested changes look good to me. Is it possible to add a test to the tests/js directory to confirm that a client query behaves as expected (accompanying the Python test coverage)? Let me know if any updates to the test framework would help to support that.

@AA-Turner
Copy link
Member

Do we need a config option? Can we just make normalisation the default?

@tokuhirom
Copy link
Author

tokuhirom commented Feb 25, 2025

@jayaddison
thanks, i added test case to the searchtools.spec.js. it checks the code automatically normalize the query.

@tokuhirom
Copy link
Author

Thanks for the suggestion, @AA-Turner! I see where you're coming from about making normalization the default. Here’s why I think keeping it as an option for now makes sense:

  1. Backward Compatibility: We want to make sure we’re not breaking anything for folks already using the current setup.

  2. Language-Specific Needs: Different languages might need different normalization types. For instance, as a Japanese speaker, I find NFKC or NFKD works best for handling full-width and half-width characters. Other languages might have their own preferences.

So, for now, I think having it as an option gives everyone the flexibility they need.

@AA-Turner
Copy link
Member

@tokuhirom are you using an LLM, out of interest?

@tokuhirom
Copy link
Author

@AA-Turner
i'm using LLM to write English since i'm not a native speaker of english(it helps me).
i'm writing the code by myself since i can write better code than LLMs :p

@AA-Turner
Copy link
Member

Sorry to ask the question -- there are more and more low-quality contributions driven by LLMs/bots, it's nice to know a human is on the other side! Thank you for the PR, too!

I still think normalising the search index by default makes sense, as eg searching for full width 'Python' vs ASCII 'Python' should produce the same results. It should be fine as long as we also normalise the search query, unless you can think of a counterexample?

A

@tokuhirom
Copy link
Author

np ;p

i can't found any counter example. but i'm not familiar with other languages than Japanese and English. there's no issue about to set the default value. but i don't know about the best default value :)

const orig = DOCUMENTATION_OPTIONS.SEARCH_UNICODE_NORMALIZATION;
try {
DOCUMENTATION_OPTIONS.SEARCH_UNICODE_NORMALIZATION = 'NFKC';
[_searchQuery, searchterms, excluded, ..._remainingItems] = Search._parseQuery('Sphinx');
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An optional suggestion: perhaps we could run a second query using Search._parseQuery('sphinx') (no normalization), and then assert that the results are identical?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i see. i added it in 6fb1fb7

@tokuhirom
Copy link
Author

@AA-Turner
I set NFKD as a default unicode normalization :)


expect(halfWidthQuery).toEqual(fullWidthQuery);
} finally {
DOCUMENTATION_OPTIONS.SEARCH_UNICODE_NORMALIZATION = orig; // restore
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm. Copying and then restoring the normalization option here seems inelegant. This may be a limitation of the test suite at the moment -- the fixtures vary for each test, but the documentation options do not.

In addition, I am considering how this might interact with #13395 or similar features in future.

I think the arrangement is OK for the moment -- but perhaps we should consider extracting a separate Search._normalizeQuery function before _parseQuery occurs.

What do you think @tokuhirom?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

at first, the root cause is the Search module depends on the global variable. It makes hard to do the unit testing.
I mean, if sphinx have an API like this,

<script src="searchtools.js"></script>
<script>
new Search({
  fileSuffix: '{{ file_suffix }}',
  ...
}).initialize();
</script>

I suggest to rewrite the APIs like this. but so, it's the too big deal.

at the moment, i moved the cleanup function to afterEach block. and split the _normalizeQuery method.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That sounds good to me; thank you.

With those changes, the def normalize function in the Python code will be a mirror of the function _normalizeQuery in the searchtools.js code. So therefore: it might be worth adding a comment in the Python code to explain that (but not in the JS code -- because not everyone reading the JS code will be able to view the Python code).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PS: I think an additional git push may be required (your explanation makes sense, but I don't find the updated commits in the branch at the moment)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oops. pushed.

Copy link
Contributor

@jayaddison jayaddison left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it might makes sense to remove the if (...) condition from the _parseQuery function and relocate that into _normalizeQuery.

Basically: always follow the same code/function call path, but sometimes some steps may be no-ops.

Comment on lines +279 to +281
_normalizeQuery: (query, form) => {
return query.normalize(form);
},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
_normalizeQuery: (query, form) => {
return query.normalize(form);
},
_normalizeQuery: (query) => {
const form = DOCUMENTATION_OPTIONS.SEARCH_UNICODE_NORMALIZATION;
if (!form) return query;
return query.normalize(form);
},

Comment on lines 283 to 288
_parseQuery: (query) => {
if (DOCUMENTATION_OPTIONS.SEARCH_UNICODE_NORMALIZATION) {
query = Search._normalizeQuery(query, DOCUMENTATION_OPTIONS.SEARCH_UNICODE_NORMALIZATION);
}

// stem the search terms and add them to the correct list
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
_parseQuery: (query) => {
if (DOCUMENTATION_OPTIONS.SEARCH_UNICODE_NORMALIZATION) {
query = Search._normalizeQuery(query, DOCUMENTATION_OPTIONS.SEARCH_UNICODE_NORMALIZATION);
}
// stem the search terms and add them to the correct list
_parseQuery: (query) => {
query = Search._normalizeQuery(query);
// stem the search terms and add them to the correct list

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants