-
-
Notifications
You must be signed in to change notification settings - Fork 628
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add optional unicode normalization before passing strings to speech or braille #16466
Comments
Some answers/reactions:
|
I'm very sorry. I actually performed tests for ij and somehow assumed it also worked for œ, but turns out it doesn't. For ij it definitely works:
I can reproduce the issue with ijsbeer with ESPeak, OneCore and Vocalizer Expressive.
For the Dutch ESpeak, bœr pronounces as bor (i.e. it behaves like only the o is present). Vocalizer and OneCore seem to ignore the ligature completely.
Same applies to Dutch.
The NFC variant doesn't provide compatible characters for ligatures and therefore doesn't touch them. While NFKD does decompose the ligeature, it also decomposes á. Therefore NFKC seems to be the only method that really makes sense. That said, I really want to know why œ is left alone and ij is decomposed correctly. |
This sounds like a good idea. Would it be possible to normalize only once if the user had both speech and braille enabled? While the overhead seems small to me, it wouldn't hurt to try and minimize it. |
I think that even when we normalize on the TextInfo level, normalization would happen once for braille and once for speech. |
A question, does the normalization could solve the problem of italic or bold unicode characters that synthesizers can't read? |
Do you have examples of this? I'm certainly willing to investigate. In the end, I'm searching for a normalization strategy that works best for anyone. |
Hi @LeonarddeR, |
Yes, |
Thus, a such PR is fundamental, in my opinion ⚡ |
I created an offset converter that seems to do this reliably now. We can add this as an optional feature to speech and braille output. |
Why should this be optional? |
It should be configurable because there are both uses cases with the feature on and off:
|
Closes #16466 Summary of the issue: Several speech synthesizers and braille tables are unable to speak or braille some characters, such as ligatures ("ij") or decomposed characters (latin letters with a modifier to add acute, diaeresis, etc.). Also, italic or bold Unicode characters can't be spoken or brailled by default. Description of user facing changes None by default. If unicode normalization is enabled for speech, speech output for objects and text navigation is normalized. For braille, normalization is applied for all braille output. The reason for speech to apply normalization only for objects and text navigation is chosen on purpose, as for individual character navigation or text selection, we really want to pass the original character to the synthesizer. If we don't Unicode bold and italic characters are read as their normalized counterparts, which makes it impossible to distinguish them. This problem is less relevant when working with braille. Description of development approach Added UnicodeNormalizationOffsetConverter to textUtils with tests. It stores the normalized version of a given string and based on diffing, calculates offset mappings between the original and normalized strings. Processes output using UnicodeNormalizationOffsetConverter when translating braille with normalization on. The several mappings (braille to raw pos, raw to braille pos) are processed to account for normalization. Added normalization to getTextInfoSpeech and getPropertiesSpeech.
@LeonarddeR you just made all equations in Ms Word accessible. All alphanumeric characters of unicode seem to be read properly now by synthesizers no matter where they appear. This is really great work! Is it possible to make this also work when using left and right arrow to move character by character? Otherwise it is really difficult to explore equations character by character. Example document with an equation and example table with all unicode alphanumeric characters: @LeonarddeR I really advise to have this enabled by default, this will make users read mathematical content right away in documents such as pdfs or MS Word where alphanumeric characters are used to build equations. Can you please also adjust the user guide to include that alphanumeric characters are also included into this normalization? cc: @michaelDCurran what do you think? |
Also this does not have a feature flag. Why did you add the additional standard (disabled) value? Could this not be just a checkbox? |
I think also that the normalized character pronounciation is ok, if people want to know the detailed character information they could use the character information add-on written by @CyrilleB79 for example. |
I think fishing emails are recognizable even withouth normalization off. I don't see a real use case why people would like to have this setting to off, unless they want more information about the character which they can retrieve via nvda+dot, or by using the character information add-on. |
That's a nice side effect really!
This can be done, but it introduces a drawback where you can no longer identify foreign characters with speech when reading character by character. We can make the character by character movement an additional option, but then it gets really messy in the end.
This definitely is a feature flag internally. Behavior is same as the "interrupt speech while scrolling" option, for example. |
However, with the normalization the pronounciation is really different from the current language even when you read foreign characters, so it should be confortable enough to know this is a foreign character.
Do you mean cancellable speech? |
Let's give an example: |
Maybe delayed character description could report that a normalization has happened. Just an idea... |
That's already possible by pressing nvda+dot once or two times or three times. The formatting of the letters is bold, this is the only thing that might matter in some cases, but for this the unicode name of the character needs to be reported which is done by the character info add-on. So I still think it is ok to add normalization also when moving character by character. Retrieving the whole typographie details of a character can already be done via other methods as already said. |
That's not true at all. It matters because the user needs to know that these are not normal characters, e.g. in the following cases:
IMO, it's really important to keep the option configurable. |
Ah this is a good point, thanks. Yes in this case we should document it properly in the user guide.However, when normalization is on we should be able to apply it for character by character navigation as well.Maybe it should not apply to char to char navigation when using the review cursor.Von meinem iPhone gesendetAm 22.05.2024 um 21:55 schrieb Cyrille Bougot ***@***.***>:
Actually it is really not of interest of the user whether these characters are normal or not. The p from "please" looks like a p on the screen, it is indeed a MATHEMATICAL BOLD CAPITAL P and whether it has unicode 0x1d40f or what ever, it really doesn't matter when reading content on the go. These details about a character are of technical nature only.
That's not true at all. It matters because the user needs to know that these are not normal characters, e.g. in the following cases:
Such character cannot be found as normal ones with NVDA search (NVDA+f3) nor with other searches such as in Notepad.
If these characters are used for a file name, for example if you copy/paste these characters (e.g. "𝐏𝐥𝐞𝐚𝐬𝐞", i.e. "Please" written with math bold characters). The file will not appear with the files beginning with a normal "P" but will be sorted in Windows Explorer after the files beginning with "Z".
IMO, it's really important to keep the option configurable.
And I would even say that the option should not be enabled by default, at least due to the search use case exposed above.
—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you commented.Message ID: ***@***.***>
|
But actually when I copy mathematical alphanumeric characters in to the NVDA find dialog, I can search for them without problems. Maybe this needs to be specified in the user guide as well.Von meinem iPhone gesendetAm 22.05.2024 um 22:15 schrieb Adriani Botez ***@***.***>:Ah this is a good point, thanks. Yes in this case we should document it properly in the user guide.However, when normalization is on we should be able to apply it for character by character navigation as well.Maybe it should not apply to char to char navigation when using the review cursor.Von meinem iPhone gesendetAm 22.05.2024 um 21:55 schrieb Cyrille Bougot ***@***.***>:
Actually it is really not of interest of the user whether these characters are normal or not. The p from "please" looks like a p on the screen, it is indeed a MATHEMATICAL BOLD CAPITAL P and whether it has unicode 0x1d40f or what ever, it really doesn't matter when reading content on the go. These details about a character are of technical nature only.
That's not true at all. It matters because the user needs to know that these are not normal characters, e.g. in the following cases:
Such character cannot be found as normal ones with NVDA search (NVDA+f3) nor with other searches such as in Notepad.
If these characters are used for a file name, for example if you copy/paste these characters (e.g. "𝐏𝐥𝐞𝐚𝐬𝐞", i.e. "Please" written with math bold characters). The file will not appear with the files beginning with a normal "P" but will be sorted in Windows Explorer after the files beginning with "Z".
IMO, it's really important to keep the option configurable.
And I would even say that the option should not be enabled by default, at least due to the search use case exposed above.
—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you commented.Message ID: ***@***.***>
|
To be clear, the points raised by @CyrilleB79 are valid but ar not a problem, so this should not hold back normalization from becomming the default behavior.
For this use case having the real characters be spoken when normalization is off or when using the review cursor to navigate the file name is confortable enough in my opinion. Now in Windows Explorer, files are grouped into bulk of letters anyway, so you can colaps or expand groups. Files names with such mathematical alphanumeric characters are very very uncommon, and yet they are all grouped under "others". |
I would also be inclined to have normalization on by default, for reasons @Adriani90 gives. It is weird seeing this as a Default/enabled/disabled choice in Speech settings, instead of a simple checkbox like most of the other options there. As to what @CyrilleB79 said about searching: the problem comes up when the user hears "please", and doesn't know there's anything strange about it. So later he searches for it, only to have it not found, and be confused. There has to be some way for the user to know this text was normalized, unless the user intentionally turns that notification off. But, having it on by default, solves the problem of users who don't even know that this kind of text exists, so have no idea they might want to turn it on. Read @Adriani90's example with the feature turned off. A user who doesn't know that people write with unusual characters that look different, but mean the same as normal characters, may think these are just weird graphics or symbols, and have no idea that there is supposed to be meaning there. |
Some further thoughts:
|
I agree with option 1 or 2, but I disagree with option 3 because this would make exploring such texts or equations really inefficient and too verbose from an UX perspective.Von meinem iPhone gesendetAm 23.05.2024 um 03:58 schrieb Luke Davis ***@***.***>:
Some further thoughts:
We could have, in Document Formatting settings, an option to have "Normalized"/"Out of normalized" announced around strings of such text, when reading it.
For characters, when moving character by character, we could have an option to either play a short tone, or announced "normalized", when reading such characters. For example, it could say "normalized P", and then for the delayed character description, "Mathematical bold capital P".
Alternatively, it could just announce the descriptive name (half-normalized?): "Mathematical bold capital P", when reviewing character by character. That would tell the user that this is a P, but would also indicate that it is an unusual character, without the user having to do anything extra to find out, just reading by character.
—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you were mentioned.Message ID: ***@***.***>
|
As a lot of discussion is happening in this issue, let's reopen it for now. #16584 can close it again. |
In what application are you searching? @XLTechie wrote:
That requires doing the normalization at the TextInfo level, adding extra fields as appropriate. Then the objects are not covered and we still need to handle them separately. I'd rather keep this speech and braille features.
I like this approahc, but the character description should be revisited separately. I believe that we should rely much more on the CLDR data for character descriptions than we do now. Furthermore, I'm not sure about the word normalized in here, it feels a bit too much like a technical term. |
I searched in browsers.Von meinem iPhone gesendetAm 23.05.2024 um 13:48 schrieb Leonard de Ruijter ***@***.***>:
@Adriani90
But actually when I copy mathematical alphanumeric characters in to the NVDA find dialog, I can search for them without problems.
In what application are you searching?
note that find is a textInfo feature and UIA and word object model support native find, whereas the other find functionality uses regex.
@XLTechie wrote:
1. We could have, in Document Formatting settings, an option to have "Normalized"/"Out of normalized" announced around strings of such text, when reading it.
That requires doing the normalization at the TextInfo level, adding extra fields as appropriate. Then the objects are not covered and we still need to handle them separately. I'd rather keep this speech and braille features.
2. For characters, when moving character by character, we could have an option to either play a short tone, or announced "normalized", when reading such characters. For example, it could say "normalized P", and then for the delayed character description, "Mathematical bold capital P".
I like this approahc, but the character description should be revisited separately. I believe that we should rely much more on the CLDR data for character descriptions than we do now. Furthermore, I'm not sure about the word normalized in here, it feels a bit too much like a technical term.
—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you were mentioned.Message ID: ***@***.***>
|
@LeonarddeR I did some more tests and I am now even more convinced this should not be optional at all, it should be always enabled:
You can test with this pdf document: So I think the best way forward would be
|
The unicode entity can still be read by pressing nvda+dot multiple times. |
Looks like that url is incorrect. |
I tried this with
Does JAWS read
I'm sorry, but not going to do that for Braille by any means. Imagine where a braille table has special definitions for characters that get normalized. That means a braille reader would lose the ability to distinguish normalized from non-normalized characters.
As said above, I cannot share your conclusion about the behavior of other screen readers. Even then, it's just as possible that you're confusing the behavior of the screen reader with that of the speech synthesizer.
Unfortunately, the CLDR doesn't contain translations of these characters. |
Jaws reads "bold smal h, bold small e, bold small l, bold small l" and so on, very verbose. Regarding narator, I tested with the newton equation in MS Word, and it worked. However there was a strange effect, I think narator imediately manimulated the characters in MS Word and displayed the normal characters, because after I turned off narator and started NVDA imediately, the equation was written in normal characters and switched back to mathematical alphanumeric characters after some seconds. |
Agree.
Not really. Jaws for example reports the unicode name even when navigating character by character, and narator reports the normalized characters in MS Word in the equation when navigating character by character. It doesn't anything to do with the synthesizer. I understand your use case for braille, but for speech at least this definitely should be the default behavior, as also @XLTechie seems to support. |
However, reporting the unicode name on the character by character navigation is too verbose, and reporting the unicode entity makes it impossible to explore the equations, as you can easily reproduce in the pdf document. |
I fully support this statement. Given all what was written, it's clear that some people want normalization of text and other not. Some want normalization when reading by character and other not. And even some may want either option depending on the use case. |
For whatever it's worth, I do agree with @Adriani90 that this should be enabled by default for speech. I have no comment about default for braille.
. I also think it is a bad idea to bother with a feature flag. It should be a checkbox in speech (and braille) settings, like everything else there. Delayed character descriptions was not given a feature flag when it was introduced, though I think that, too, should have been enabled by default like other screen readers.
Really, the current behavior in stable is highly undesirable, so I am definitely jumping at this opportunity to put an end to it.. Users will welcome this, I believe.
But, for reasons I already mentioned, I don't think users will necessarily know that they want it. Which is why enablement should be default IMO.
Though I agree with others that there must be an ability to turn it off, at least at this stage.
The last thing I will say, is that while I do want something to be done for character navigation eventually, the feature as it stands is already a laudable improvement, and a great start to build upon in the future.
|
I wonder if a 3-choice option could be better:
Option 1 is useful if you hear "please" while "𝐏𝐥𝐞𝐚𝐬𝐞" (= "please" in math characters) is actually written; to avoid typing "please" in the find window and hoping to find it. Or useful if you want to easily detect misuse of such characters such as in fishing e-mails. |
I a gree after all this can remain an option.
I would prefer to hear a delayed description with an untranslated unicode name rather than this very technical unicode entities. You can always retrieve the unicode entities by pressing numpad2. By the way, there seems to be an API which translates all unicode names based on locale identifier, and it is compliant with BCP47 and CLDR: |
So in my view we still would go really well with options 1 and 3, while option 3 could be used together with delayed character description using the translated unicode names from the API. |
@Adriani90 The module you shared is in Perl, that's unusable for us in this form. |
Now #16584 is merged, I'm going to close this again and open new issues as proposed in #16584 (comment) |
Is your feature request related to a problem? Please describe.
In some cases, text can contain ligature characters that are not provided in a braille table. Alternatively, a speech synthesizer kan really struggle with these.
An example is the ligature ij (ij), in dutch as in ijsbeer (polar bear). The Dutch version of ESPeak is unable to pronounce this word correctly.
An exactly opposite example is á, which is composed of two characters, namely the letter a and the modifier ́.
Describe the solution you'd like
For both speech and braille, i propose adding the ability to enable unicode normalization with the NFKC algoritm (Normalization Form Compatibility Composition). This algorithm ensures that most ligatures are properly decomposed before passing them to the synthesizer while composing characters like á, which is much more common than á.
Note that while this sounds utterly complex, it is basically adding one line of code:
processed = unicodedata.normalize("NFKC", original)
The text was updated successfully, but these errors were encountered: