-
-
Notifications
You must be signed in to change notification settings - Fork 2.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Unicode Normalization for Search Indexing #13384
base: master
Are you sure you want to change the base?
Add Unicode Normalization for Search Indexing #13384
Conversation
bf89eb3
to
ea7a39b
Compare
- Introduced `html_search_unicode_normalization` configuration option to specify Unicode normalization form (NFC, NFD, NFKC, NFKD) for search indexing. - Updated the HTML builder to pass the normalization configuration to the search indexer. - Modified the `IndexBuilder` and `_feed_visit_nodes` functions to apply the specified Unicode normalization to document text before indexing. - Updated JavaScript search tools to normalize search queries using the specified normalization form. - Added documentation for the new configuration option in `configuration.rst`. - Implemented a test case to verify that full-width characters like 'Python' are normalized and indexed as 'python'.
ea7a39b
to
a228fad
Compare
Note: |
Thank you @tokuhirom - the suggested changes look good to me. Is it possible to add a test to the |
Do we need a config option? Can we just make normalisation the default? |
@jayaddison |
Thanks for the suggestion, @AA-Turner! I see where you're coming from about making normalization the default. Here’s why I think keeping it as an option for now makes sense:
So, for now, I think having it as an option gives everyone the flexibility they need. |
@tokuhirom are you using an LLM, out of interest? |
@AA-Turner |
Sorry to ask the question -- there are more and more low-quality contributions driven by LLMs/bots, it's nice to know a human is on the other side! Thank you for the PR, too! I still think normalising the search index by default makes sense, as eg searching for full width 'Python' vs ASCII 'Python' should produce the same results. It should be fine as long as we also normalise the search query, unless you can think of a counterexample? A |
np ;p i can't found any counter example. but i'm not familiar with other languages than Japanese and English. there's no issue about to set the default value. but i don't know about the best default value :) |
tests/js/searchtools.spec.js
Outdated
const orig = DOCUMENTATION_OPTIONS.SEARCH_UNICODE_NORMALIZATION; | ||
try { | ||
DOCUMENTATION_OPTIONS.SEARCH_UNICODE_NORMALIZATION = 'NFKC'; | ||
[_searchQuery, searchterms, excluded, ..._remainingItems] = Search._parseQuery('Sphinx'); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
An optional suggestion: perhaps we could run a second query using Search._parseQuery('sphinx')
(no normalization), and then assert that the results are identical?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i see. i added it in 6fb1fb7
@AA-Turner |
tests/js/searchtools.spec.js
Outdated
|
||
expect(halfWidthQuery).toEqual(fullWidthQuery); | ||
} finally { | ||
DOCUMENTATION_OPTIONS.SEARCH_UNICODE_NORMALIZATION = orig; // restore |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm. Copying and then restoring the normalization option here seems inelegant. This may be a limitation of the test suite at the moment -- the fixtures vary for each test, but the documentation options do not.
In addition, I am considering how this might interact with #13395 or similar features in future.
I think the arrangement is OK for the moment -- but perhaps we should consider extracting a separate Search._normalizeQuery
function before _parseQuery
occurs.
What do you think @tokuhirom?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
at first, the root cause is the Search module depends on the global variable. It makes hard to do the unit testing.
I mean, if sphinx have an API like this,
<script src="searchtools.js"></script>
<script>
new Search({
fileSuffix: '{{ file_suffix }}',
...
}).initialize();
</script>
I suggest to rewrite the APIs like this. but so, it's the too big deal.
at the moment, i moved the cleanup function to afterEach
block. and split the _normalizeQuery
method.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That sounds good to me; thank you.
With those changes, the def normalize
function in the Python code will be a mirror of the function _normalizeQuery
in the searchtools.js
code. So therefore: it might be worth adding a comment in the Python code to explain that (but not in the JS code -- because not everyone reading the JS code will be able to view the Python code).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PS: I think an additional git push
may be required (your explanation makes sense, but I don't find the updated commits in the branch at the moment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oops. pushed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it might makes sense to remove the if (...)
condition from the _parseQuery
function and relocate that into _normalizeQuery
.
Basically: always follow the same code/function call path, but sometimes some steps may be no-ops.
_normalizeQuery: (query, form) => { | ||
return query.normalize(form); | ||
}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
_normalizeQuery: (query, form) => { | |
return query.normalize(form); | |
}, | |
_normalizeQuery: (query) => { | |
const form = DOCUMENTATION_OPTIONS.SEARCH_UNICODE_NORMALIZATION; | |
if (!form) return query; | |
return query.normalize(form); | |
}, |
_parseQuery: (query) => { | ||
if (DOCUMENTATION_OPTIONS.SEARCH_UNICODE_NORMALIZATION) { | ||
query = Search._normalizeQuery(query, DOCUMENTATION_OPTIONS.SEARCH_UNICODE_NORMALIZATION); | ||
} | ||
|
||
// stem the search terms and add them to the correct list |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
_parseQuery: (query) => { | |
if (DOCUMENTATION_OPTIONS.SEARCH_UNICODE_NORMALIZATION) { | |
query = Search._normalizeQuery(query, DOCUMENTATION_OPTIONS.SEARCH_UNICODE_NORMALIZATION); | |
} | |
// stem the search terms and add them to the correct list | |
_parseQuery: (query) => { | |
query = Search._normalizeQuery(query); | |
// stem the search terms and add them to the correct list |
html_search_unicode_normalization
configuration option to specify Unicode normalization form (NFC, NFD, NFKC, NFKD) for search indexing.IndexBuilder
and_feed_visit_nodes
functions to apply the specified Unicode normalization to document text before indexing.configuration.rst
.Purpose
This pull request introduces Unicode normalization for search indexing, allowing for consistent handling of text across different Unicode representations. This change improves the accuracy and reliability of search results.
Context and Background
The need for normalization was identified to handle cases where text input might vary in form, such as full-width and half-width characters in Japanese text.