Add optional unicode normalization before passing strings to speech or braille #16466

LeonarddeR · 2024-04-30T10:58:10Z

Is your feature request related to a problem? Please describe.

In some cases, text can contain ligature characters that are not provided in a braille table. Alternatively, a speech synthesizer kan really struggle with these.

An example is the ligature ĳ (ij), in dutch as in ĳsbeer (polar bear). The Dutch version of ESPeak is unable to pronounce this word correctly.

An exactly opposite example is á, which is composed of two characters, namely the letter a and the modifier ́.

Describe the solution you'd like

For both speech and braille, i propose adding the ability to enable unicode normalization with the NFKC algoritm (Normalization Form Compatibility Composition). This algorithm ensures that most ligatures are properly decomposed before passing them to the synthesizer while composing characters like á, which is much more common than á.

Note that while this sounds utterly complex, it is basically adding one line of code:
processed = unicodedata.normalize("NFKC", original)

The text was updated successfully, but these errors were encountered:

CyrilleB79 · 2024-04-30T11:56:59Z

Some answers/reactions:

Unicode normalization on "bœr" does not seem to change anything on my end. How does it improve things for Dutch reading?
Is the issue only with eSpeak or with many TTS?
I have not noticed any issue with "œ" character in French TTS. This ligature is also used in French but TTS seem to behave the same no matter if we have the ligature or two separate characters ("oe")
Regarding the characters with diacritic, using one compound character ("á") behaves much better with TTSs then two characters ("á") at least when reading French.
I think that such normalization is offered and customizable in at least one TTS; I do not remember which one.
I do not know the difference between NFKC and NFC; is one normalization form preferable and why?

LeonarddeR · 2024-04-30T14:03:59Z

1. Unicode normalization on "bœr" does not seem to change anything on my end. How does it improve things for Dutch reading?

I'm very sorry. I actually performed tests for ĳ and somehow assumed it also worked for œ, but turns out it doesn't. For ĳ it definitely works: unicodedata.normalize("NFKC", "ĳsbeer") returns ijsbeer
I updated the initial description accordingly.

2. Is the issue only with eSpeak or with many TTS?

I can reproduce the issue with ĳsbeer with ESPeak, OneCore and Vocalizer Expressive.

3. I have not noticed any issue with "œ" character in French TTS. This ligature is also used in French but TTS seem to behave the same no matter if we have the ligature or two separate characters ("oe")

For the Dutch ESpeak, bœr pronounces as bor (i.e. it behaves like only the o is present). Vocalizer and OneCore seem to ignore the ligature completely.

4. Regarding the characters with diacritic, using one compound character ("á") behaves much better with TTSs then two characters ("á") at least when reading French.

Same applies to Dutch.

6. I do not know the difference between NFKC and NFC; is one normalization form preferable and why?

The NFC variant doesn't provide compatible characters for ligatures and therefore doesn't touch them. While NFKD does decompose the ligeature, it also decomposes á. Therefore NFKC seems to be the only method that really makes sense.

That said, I really want to know why œ is left alone and ĳ is decomposed correctly.

dkager · 2024-04-30T18:02:33Z

This sounds like a good idea. Would it be possible to normalize only once if the user had both speech and braille enabled? While the overhead seems small to me, it wouldn't hurt to try and minimize it.

LeonarddeR · 2024-04-30T19:24:45Z

I think that even when we normalize on the TextInfo level, normalization would happen once for braille and once for speech.
That said, I now realize that normalizing might break offset based text infos. Need to check that.

thgcode · 2024-05-01T18:33:01Z

A question, does the normalization could solve the problem of italic or bold unicode characters that synthesizers can't read?

LeonarddeR · 2024-05-01T19:08:09Z

Do you have examples of this? I'm certainly willing to investigate. In the end, I'm searching for a normalization strategy that works best for anyone.

ABuffEr · 2024-05-01T20:04:02Z

Do you have examples of this? I'm certainly willing to investigate.

Hi @LeonarddeR,
this should be a good resource. If you want, see also this add-on, that uses Unidecode library.

LeonarddeR · 2024-05-02T05:20:48Z

Yes, 𝒊𝒕𝒂𝒍𝒊𝒄𝒔 is normalized to italics

ABuffEr · 2024-05-02T08:34:35Z

Thus, a such PR is fundamental, in my opinion ⚡
Seriously, whole keywords, titles and phrases are written with these characters nowadays, on posts of social networks, nicknames in chats, and so on. In the majority of cases, NVDA simply ignores them.
Working to fix textInfos offsets should bring to a Braille emoji translations too, that is a welcome feature directly in core, for me (and other users with hearing problems, I think).
And I appreciate a lot even ligature fix, that struggled me from time to time during university studies.

LeonarddeR · 2024-05-02T09:54:39Z

Actually, thanks to very valuable work by @mltony in #16219, I think we can make this work.
We can create a new offset converter for normalization and use that to map normalized positions to real raw positions in the text. This way cursor routing and presentation should still work.

LeonarddeR · 2024-05-04T12:05:21Z

I created an offset converter that seems to do this reliably now. We can add this as an optional feature to speech and braille output.

Adriani90 · 2024-05-04T22:28:46Z

Why should this be optional?

CyrilleB79 · 2024-05-05T09:20:33Z

It should be configurable because there are both uses cases with the feature on and off:

normalization off (as today): allows to detect easily letter-like characters used in fishing e-mails, either because your synth ignores it or because it reports it differently
normalization on:
- useful to be able to read with some TTSs letters with diacritics written with two separate characters instead of a combined one.
- useful to read text written with specific unicode characters (e.g. unicode italic or unicode bold) instead of standard ones, e.g. frequently used for nicknames in some forums

Closes #16466 Summary of the issue: Several speech synthesizers and braille tables are unable to speak or braille some characters, such as ligatures ("ĳ") or decomposed characters (latin letters with a modifier to add acute, diaeresis, etc.). Also, italic or bold Unicode characters can't be spoken or brailled by default. Description of user facing changes None by default. If unicode normalization is enabled for speech, speech output for objects and text navigation is normalized. For braille, normalization is applied for all braille output. The reason for speech to apply normalization only for objects and text navigation is chosen on purpose, as for individual character navigation or text selection, we really want to pass the original character to the synthesizer. If we don't Unicode bold and italic characters are read as their normalized counterparts, which makes it impossible to distinguish them. This problem is less relevant when working with braille. Description of development approach Added UnicodeNormalizationOffsetConverter to textUtils with tests. It stores the normalized version of a given string and based on diffing, calculates offset mappings between the original and normalized strings. Processes output using UnicodeNormalizationOffsetConverter when translating braille with normalization on. The several mappings (braille to raw pos, raw to braille pos) are processed to account for normalization. Added normalization to getTextInfoSpeech and getPropertiesSpeech.

Adriani90 · 2024-05-21T21:12:16Z

@LeonarddeR you just made all equations in Ms Word accessible. All alphanumeric characters of unicode seem to be read properly now by synthesizers no matter where they appear. This is really great work!

Is it possible to make this also work when using left and right arrow to move character by character? Otherwise it is really difficult to explore equations character by character.

Example document with an equation and example table with all unicode alphanumeric characters:
Newton.docx
alphanumeric Mathematical symbols.xlsx

@LeonarddeR I really advise to have this enabled by default, this will make users read mathematical content right away in documents such as pdfs or MS Word where alphanumeric characters are used to build equations.
People who don't need the normalization can turn it off anyway, but I think the basis or users who benefit of this normalization is huge.

Can you please also adjust the user guide to include that alphanumeric characters are also included into this normalization?

cc: @michaelDCurran what do you think?

Adriani90 · 2024-05-21T21:13:41Z

Also this does not have a feature flag. Why did you add the additional standard (disabled) value? Could this not be just a checkbox?

Adriani90 · 2024-05-21T21:51:46Z

I think also that the normalized character pronounciation is ok, if people want to know the detailed character information they could use the character information add-on written by @CyrilleB79 for example.

Adriani90 · 2024-05-21T21:58:14Z

@CyrilleB79

normalization off (as today): allows to detect easily letter-like characters used in fishing e-mails, either because your synth ignores it or because it reports it differently

I think fishing emails are recognizable even withouth normalization off. I don't see a real use case why people would like to have this setting to off, unless they want more information about the character which they can retrieve via nvda+dot, or by using the character information add-on.

LeonarddeR · 2024-05-22T05:29:47Z

@LeonarddeR you just made all equations in Ms Word accessible. All alphanumeric characters of unicode seem to be read properly now by synthesizers no matter where they appear. This is really great work!

That's a nice side effect really!

Is it possible to make this also work when using left and right arrow to move character by character? Otherwise it is really difficult to explore equations character by character.

This can be done, but it introduces a drawback where you can no longer identify foreign characters with speech when reading character by character. We can make the character by character movement an additional option, but then it gets really messy in the end.

Also this does not have a feature flag. Why did you add the additional standard (disabled) value? Could this not be just a checkbox?

This definitely is a feature flag internally. Behavior is same as the "interrupt speech while scrolling" option, for example.

Adriani90 · 2024-05-22T05:53:36Z

This can be done, but it introduces a drawback where you can no longer identify foreign characters with speech when reading character by character. We can make the character by character movement an additional option, but then it gets really messy in the end.

However, with the normalization the pronounciation is really different from the current language even when you read foreign characters, so it should be confortable enough to know this is a foreign character.
Then people could use nvda+dot or the said add-on to identify its details.
An alternative could be to keep the current character by character pronounciation only for the review cursor, but not for the system carret.

This definitely is a feature flag internally. Behavior is same as the "interrupt speech while scrolling" option, for example.

Do you mean cancellable speech?
That one is well tested and there should actually not be any feature flag on it at all. Reef put that flag on it because there were some issues at the beginning when switching windows, but that has been fixed long ago.

LeonarddeR · 2024-05-22T06:50:14Z

However, with the normalization the pronounciation is really different from the current language even when you read foreign characters, so it should be confortable enough to know this is a foreign character.

Let's give an example:
𝐏𝐥𝐞𝐚𝐬𝐞 𝐫𝐞𝐚𝐝 𝐭𝐡𝐢𝐬 𝐬𝐞𝐧𝐭𝐞𝐧𝐜𝐞 𝐜𝐡𝐚𝐫𝐚𝐜𝐭𝐞𝐫 𝐛𝐲 𝐜𝐡𝐚𝐫𝐚𝐜𝐭𝐞𝐫
This sentence now reads like a charm with normalization on. However, these are definitely not normal characters. There must still be a way to recognize the characters as they are.

ABuffEr · 2024-05-22T09:04:17Z

I think also that the normalized character pronounciation is ok, if people want to know the detailed character information they could use the character information add-on written by @CyrilleB79 for example.

Maybe delayed character description could report that a normalization has happened. Just an idea...

Adriani90 · 2024-05-22T19:05:33Z

Let's give an example:
𝐏𝐥𝐞𝐚𝐬𝐞 𝐫𝐞𝐚𝐝 𝐭𝐡𝐢𝐬 𝐬𝐞𝐧𝐭𝐞𝐧𝐜𝐞 𝐜𝐡𝐚𝐫𝐚𝐜𝐭𝐞𝐫 𝐛𝐲 𝐜𝐡𝐚𝐫𝐚𝐜𝐭𝐞𝐫
This sentence now reads like a charm with normalization on. However, these are definitely not normal characters. There must still be a way to recognize the characters as they are.

That's already possible by pressing nvda+dot once or two times or three times.
When using the character info add-on, the detailed info about the caracter is even displayed into a browseable window.
Actually it is really not of interest of the user whether these characters are normal or not. The p from "please" looks like a p on the screen, it is indeed a MATHEMATICAL BOLD CAPITAL P and whether it has unicode 0x1d40f or what ever, it really doesn't matter when reading content on the go. These details about a character are of technical nature only.

The formatting of the letters is bold, this is the only thing that might matter in some cases, but for this the unicode name of the character needs to be reported which is done by the character info add-on.
NVDA cannot report unicode names natively anyway. But this is another issue which could be addressed later on.

So I still think it is ok to add normalization also when moving character by character. Retrieving the whole typographie details of a character can already be done via other methods as already said.
Actually I think normalization does not even need an option at all, it should be always enabled. But that's only my opinion.

CyrilleB79 · 2024-05-22T19:54:55Z

Actually it is really not of interest of the user whether these characters are normal or not. The p from "please" looks like a p on the screen, it is indeed a MATHEMATICAL BOLD CAPITAL P and whether it has unicode 0x1d40f or what ever, it really doesn't matter when reading content on the go. These details about a character are of technical nature only.

That's not true at all. It matters because the user needs to know that these are not normal characters, e.g. in the following cases:

Such character cannot be found as normal ones with NVDA search (NVDA+f3) nor with other searches such as in Notepad.
If these characters are used for a file name, for example if you copy/paste these characters (e.g. "𝐏𝐥𝐞𝐚𝐬𝐞", i.e. "Please" written with math bold characters). The file will not appear with the files beginning with a normal "P" but will be sorted in Windows Explorer after the files beginning with "Z".

IMO, it's really important to keep the option configurable.
And I would even say that the option should not be enabled by default, at least due to the search use case exposed above.

Adriani90 · 2024-05-22T20:15:52Z

Ah this is a good point, thanks. Yes in this case we should document it properly in the user guide.However, when normalization is on we should be able to apply it for character by character navigation as well.Maybe it should not apply to char to char navigation when using the review cursor.Von meinem iPhone gesendetAm 22.05.2024 um 21:55 schrieb Cyrille Bougot ***@***.***>: Actually it is really not of interest of the user whether these characters are normal or not. The p from "please" looks like a p on the screen, it is indeed a MATHEMATICAL BOLD CAPITAL P and whether it has unicode 0x1d40f or what ever, it really doesn't matter when reading content on the go. These details about a character are of technical nature only. That's not true at all. It matters because the user needs to know that these are not normal characters, e.g. in the following cases: Such character cannot be found as normal ones with NVDA search (NVDA+f3) nor with other searches such as in Notepad. If these characters are used for a file name, for example if you copy/paste these characters (e.g. "𝐏𝐥𝐞𝐚𝐬𝐞", i.e. "Please" written with math bold characters). The file will not appear with the files beginning with a normal "P" but will be sorted in Windows Explorer after the files beginning with "Z". IMO, it's really important to keep the option configurable. And I would even say that the option should not be enabled by default, at least due to the search use case exposed above. —Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you commented.Message ID: ***@***.***>

Adriani90 · 2024-05-22T20:18:34Z

But actually when I copy mathematical alphanumeric characters in to the NVDA find dialog, I can search for them without problems. Maybe this needs to be specified in the user guide as well.Von meinem iPhone gesendetAm 22.05.2024 um 22:15 schrieb Adriani Botez ***@***.***>:Ah this is a good point, thanks. Yes in this case we should document it properly in the user guide.However, when normalization is on we should be able to apply it for character by character navigation as well.Maybe it should not apply to char to char navigation when using the review cursor.Von meinem iPhone gesendetAm 22.05.2024 um 21:55 schrieb Cyrille Bougot ***@***.***>: Actually it is really not of interest of the user whether these characters are normal or not. The p from "please" looks like a p on the screen, it is indeed a MATHEMATICAL BOLD CAPITAL P and whether it has unicode 0x1d40f or what ever, it really doesn't matter when reading content on the go. These details about a character are of technical nature only. That's not true at all. It matters because the user needs to know that these are not normal characters, e.g. in the following cases: Such character cannot be found as normal ones with NVDA search (NVDA+f3) nor with other searches such as in Notepad. If these characters are used for a file name, for example if you copy/paste these characters (e.g. "𝐏𝐥𝐞𝐚𝐬𝐞", i.e. "Please" written with math bold characters). The file will not appear with the files beginning with a normal "P" but will be sorted in Windows Explorer after the files beginning with "Z". IMO, it's really important to keep the option configurable. And I would even say that the option should not be enabled by default, at least due to the search use case exposed above. —Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you commented.Message ID: ***@***.***>

Adriani90 · 2024-05-23T00:24:50Z

To be clear, the points raised by @CyrilleB79 are valid but ar not a problem, so this should not hold back normalization from becomming the default behavior.

NVDA can search via nvda+f3 or nvda+shift+f3 also by any unicode character. It doesn't matter whether it can be entered via keyboard or not.
How Windows sorts files is really not the job of NVDA.

The file will not appear with the files beginning with a normal "P" but will be sorted in Windows Explorer after the files beginning with "Z".

For this use case having the real characters be spoken when normalization is off or when using the review cursor to navigate the file name is confortable enough in my opinion. Now in Windows Explorer, files are grouped into bulk of letters anyway, so you can colaps or expand groups. Files names with such mathematical alphanumeric characters are very very uncommon, and yet they are all grouped under "others".

XLTechie · 2024-05-23T01:51:58Z

I would also be inclined to have normalization on by default, for reasons @Adriani90 gives. It is weird seeing this as a Default/enabled/disabled choice in Speech settings, instead of a simple checkbox like most of the other options there.
Though I do think the ability to turn it off should be preserved, as CLDR and Delayed Descriptions can be.

As to what @CyrilleB79 said about searching: the problem comes up when the user hears "please", and doesn't know there's anything strange about it. So later he searches for it, only to have it not found, and be confused. There has to be some way for the user to know this text was normalized, unless the user intentionally turns that notification off.

But, having it on by default, solves the problem of users who don't even know that this kind of text exists, so have no idea they might want to turn it on. Read @Adriani90's example with the feature turned off. A user who doesn't know that people write with unusual characters that look different, but mean the same as normal characters, may think these are just weird graphics or symbols, and have no idea that there is supposed to be meaning there.

XLTechie · 2024-05-23T01:57:48Z

Some further thoughts:

We could have, in Document Formatting settings, an option to have "Normalized"/"Out of normalized" announced around strings of such text, when reading it.
For characters, when moving character by character, we could have an option to either play a short tone, or announced "normalized", when reading such characters. For example, it could say "normalized P", and then for the delayed character description, "Mathematical bold capital P".
Alternatively, it could just announce the descriptive name (half-normalized?): "Mathematical bold capital P", when reviewing character by character. That would tell the user that this is a P, but would also indicate that it is an unusual character, without the user having to do anything extra to find out, just reading by character.

Adriani90 · 2024-05-23T02:16:05Z

I agree with option 1 or 2, but I disagree with option 3 because this would make exploring such texts or equations really inefficient and too verbose from an UX perspective.Von meinem iPhone gesendetAm 23.05.2024 um 03:58 schrieb Luke Davis ***@***.***>: Some further thoughts: We could have, in Document Formatting settings, an option to have "Normalized"/"Out of normalized" announced around strings of such text, when reading it. For characters, when moving character by character, we could have an option to either play a short tone, or announced "normalized", when reading such characters. For example, it could say "normalized P", and then for the delayed character description, "Mathematical bold capital P". Alternatively, it could just announce the descriptive name (half-normalized?): "Mathematical bold capital P", when reviewing character by character. That would tell the user that this is a P, but would also indicate that it is an unusual character, without the user having to do anything extra to find out, just reading by character. —Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you were mentioned.Message ID: ***@***.***>

LeonarddeR · 2024-05-23T05:57:50Z

As a lot of discussion is happening in this issue, let's reopen it for now. #16584 can close it again.

LeonarddeR · 2024-05-23T11:48:14Z

@Adriani90

But actually when I copy mathematical alphanumeric characters in to the NVDA find dialog, I can search for them without problems.

In what application are you searching?
note that find is a textInfo feature and UIA and word object model support native find, whereas the other find functionality uses regex.

@XLTechie wrote:

1. We could have, in Document Formatting settings, an option to have "Normalized"/"Out of normalized" announced around strings of such text, when reading it.

That requires doing the normalization at the TextInfo level, adding extra fields as appropriate. Then the objects are not covered and we still need to handle them separately. I'd rather keep this speech and braille features.

2. For characters, when moving character by character, we could have an option to either play a short tone, or announced "normalized", when reading such characters. For example, it could say "normalized P", and then for the delayed character description, "Mathematical bold capital P".

I like this approahc, but the character description should be revisited separately. I believe that we should rely much more on the CLDR data for character descriptions than we do now. Furthermore, I'm not sure about the word normalized in here, it feels a bit too much like a technical term.

Adriani90 · 2024-05-23T12:01:23Z

I searched in browsers.Von meinem iPhone gesendetAm 23.05.2024 um 13:48 schrieb Leonard de Ruijter ***@***.***>: @Adriani90 But actually when I copy mathematical alphanumeric characters in to the NVDA find dialog, I can search for them without problems. In what application are you searching? note that find is a textInfo feature and UIA and word object model support native find, whereas the other find functionality uses regex. @XLTechie wrote: 1. We could have, in Document Formatting settings, an option to have "Normalized"/"Out of normalized" announced around strings of such text, when reading it. That requires doing the normalization at the TextInfo level, adding extra fields as appropriate. Then the objects are not covered and we still need to handle them separately. I'd rather keep this speech and braille features. 2. For characters, when moving character by character, we could have an option to either play a short tone, or announced "normalized", when reading such characters. For example, it could say "normalized P", and then for the delayed character description, "Mathematical bold capital P". I like this approahc, but the character description should be revisited separately. I believe that we should rely much more on the CLDR data for character descriptions than we do now. Furthermore, I'm not sure about the word normalized in here, it feels a bit too much like a technical term. —Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you were mentioned.Message ID: ***@***.***>

Adriani90 · 2024-05-25T07:53:33Z

@LeonarddeR I did some more tests and I am now even more convinced this should not be optional at all, it should be always enabled:

Narator does it natively as well, there is no option to disable this, but Narator doesn't announce the unicode name or anything else that could indicate the characters are normalized
Jaws does it as well without any option to disable it, but it reports the whole unicode name on every character which is really painfully verbose.

You can test with this pdf document:
https://www.andrews.edu/~rwright/physic … 0Sheet.pdf

So I think the best way forward would be

Make normalization always active, no option
Make normalization active for character by character navigation, as it is done in other screen readers
Add the unicode names to the optional delayed character descriptions for who ever need them to be reported.

Adriani90 · 2024-05-25T08:02:47Z

The unicode entity can still be read by pressing nvda+dot multiple times.

LeonarddeR · 2024-05-25T08:25:57Z

You can test with this pdf document: https://www.andrews.edu/~rwright/physic … 0Sheet.pdf

Looks like that url is incorrect.

LeonarddeR · 2024-05-25T08:51:05Z

* Narator does it natively as well, there is no option to disable this, but Narator doesn't announce the unicode name or anything else that could indicate the characters are normalized

I tried this with 𝐡𝐞𝐥𝐥𝐨 𝐚𝐥𝐥!, but that definitnely doesn't work for me in Narrator. Narrator just stays silent when doing character navigation.

* Jaws does it as well without any option to disable it, but it reports the whole unicode name on every character which is really painfully verbose.

Does JAWS read 𝐡𝐞𝐥𝐥𝐨 𝐚𝐥𝐥! properly when reading by line?

1. Make normalization always active, no option

I'm sorry, but not going to do that for Braille by any means. Imagine where a braille table has special definitions for characters that get normalized. That means a braille reader would lose the ability to distinguish normalized from non-normalized characters.
For speech, we can certainly consider making it enabled by default, but again, it's pretty clear that others agree with me that this really should be an option that can be turned off. I understand that you believe it should be mandatory, but that does not alter the opinion of others.
I would honestly like to end the discussion about making it mandatory. We can still talk about enabled/disabled by default, but I'd rather do that at a later stage.

2. Make normalization active for character by character navigation, as it is done in other screen readers

As said above, I cannot share your conclusion about the behavior of other screen readers. Even then, it's just as possible that you're confusing the behavior of the screen reader with that of the speech synthesizer.

3. Add the unicode names to the optional delayed character descriptions for who ever need them to be reported.

Unfortunately, the CLDR doesn't contain translations of these characters.

Adriani90 · 2024-05-25T10:06:35Z

Jaws reads "bold smal h, bold small e, bold small l, bold small l" and so on, very verbose.
Here is the pdf again, hope this works better:
https://www.andrews.edu/~rwright/physics/Physics%20Formula%20Sheet.pdf

Regarding narator, I tested with the newton equation in MS Word, and it worked. However there was a strange effect, I think narator imediately manimulated the characters in MS Word and displayed the normal characters, because after I turned off narator and started NVDA imediately, the equation was written in normal characters and switched back to mathematical alphanumeric characters after some seconds.

Adriani90 · 2024-05-25T10:10:40Z

We can still talk about enabled/disabled by default, but I'd rather do that at a later stage.

Agree.

it's just as possible that you're confusing the behavior of the screen reader with that of the speech synthesizer.

Not really. Jaws for example reports the unicode name even when navigating character by character, and narator reports the normalized characters in MS Word in the equation when navigating character by character. It doesn't anything to do with the synthesizer.

I understand your use case for braille, but for speech at least this definitely should be the default behavior, as also @XLTechie seems to support.

Adriani90 · 2024-05-25T10:12:18Z

However, reporting the unicode name on the character by character navigation is too verbose, and reporting the unicode entity makes it impossible to explore the equations, as you can easily reproduce in the pdf document.

CyrilleB79 · 2024-05-25T20:36:44Z

Make normalization always active, no option
I'm sorry, but not going to do that for Braille by any means. Imagine where a braille table has special definitions for characters that get normalized. That means a braille reader would lose the ability to distinguish normalized from non-normalized characters. For speech, we can certainly consider making it enabled by default, but again, it's pretty clear that others agree with me that this really should be an option that can be turned off. I understand that you believe it should be mandatory, but that does not alter the opinion of others. I would honestly like to end the discussion about making it mandatory. We can still talk about enabled/disabled by default, but I'd rather do that at a later stage.

I fully support this statement. Given all what was written, it's clear that some people want normalization of text and other not. Some want normalization when reading by character and other not. And even some may want either option depending on the use case.
Thus forcing normalization without an option to disable it is not a suitable option at all.

XLTechie · 2024-05-26T05:46:27Z

For whatever it's worth, I do agree with @Adriani90 that this should be enabled by default for speech. I have no comment about default for braille. . I also think it is a bad idea to bother with a feature flag. It should be a checkbox in speech (and braille) settings, like everything else there. Delayed character descriptions was not given a feature flag when it was introduced, though I think that, too, should have been enabled by default like other screen readers. Really, the current behavior in stable is highly undesirable, so I am definitely jumping at this opportunity to put an end to it.. Users will welcome this, I believe. But, for reasons I already mentioned, I don't think users will necessarily know that they want it. Which is why enablement should be default IMO. Though I agree with others that there must be an ability to turn it off, at least at this stage. The last thing I will say, is that while I do want something to be done for character navigation eventually, the feature as it stands is already a laudable improvement, and a great start to build upon in the future.

CyrilleB79 · 2024-05-26T09:19:24Z

I wonder if a 3-choice option could be better:

Normalization disabled
Normalization enabled to read text but disabled to read by character
Normalization enabled for all

Option 1 is useful if you hear "please" while "𝐏𝐥𝐞𝐚𝐬𝐞" (= "please" in math characters) is actually written; to avoid typing "please" in the find window and hoping to find it. Or useful if you want to easily detect misuse of such characters such as in fishing e-mails.
Option 2 is useful if you want to hear text read correctly but have an easy way through character navigation to know that these are special characters and that you will not be able to find them typing them in the find window. This is useful for nicknames that you usually do not need to spell or not frequently.
Option 3, is useful if you want to by able to read frequently such text character by character, e.g. in Word equations.

Adriani90 · 2024-05-26T12:13:47Z

Option 1 is useful if you hear "please" while "𝐏𝐥𝐞𝐚𝐬𝐞" (= "please" in math characters) is actually written; to avoid typing "please" in the find window and hoping to find it. Or useful if you want to easily detect misuse of such characters such as in fishing e-mails.

I a gree after all this can remain an option.

Option 2 is useful if you want to hear text read correctly but have an easy way through character navigation to know that these are special characters and that you will not be able to find them typing them in the find window. This is useful for nicknames that you usually do not need to spell or not frequently.

I would prefer to hear a delayed description with an untranslated unicode name rather than this very technical unicode entities. You can always retrieve the unicode entities by pressing numpad2.

By the way, there seems to be an API which translates all unicode names based on locale identifier, and it is compliant with BCP47 and CLDR:
https://metacpan.org/pod/Locale::Unicode

Adriani90 · 2024-05-26T12:17:16Z

So in my view we still would go really well with options 1 and 3, while option 3 could be used together with delayed character description using the translated unicode names from the API.

LeonarddeR · 2024-05-27T05:47:24Z

@Adriani90 The module you shared is in Perl, that's unusable for us in this form.

LeonarddeR · 2024-05-27T06:02:05Z

Now #16584 is merged, I'm going to close this again and open new issues as proposed in #16584 (comment)

LeonarddeR changed the title ~~Add optionall unicode normalization~~ Add optionall unicode normalization before passing strings to speech or braille Apr 30, 2024

LeonarddeR mentioned this issue May 2, 2024

glyph vs diaeresis liblouis/liblouis#98

Open

seanbudd changed the title ~~Add optionall unicode normalization before passing strings to speech or braille~~ Add optional unicode normalization before passing strings to speech or braille May 7, 2024

seanbudd added p4 https://github.com/nvaccess/nvda/blob/master/projectDocs/issues/triage.md#priority triaged Has been triaged, issue is waiting for implementation. labels May 7, 2024

LeonarddeR mentioned this issue May 10, 2024

Add Unicode Normalization to speech and braille #16521

Merged

5 tasks

seanbudd closed this as completed in #16521 May 21, 2024

LeonarddeR reopened this May 23, 2024

LeonarddeR closed this as completed May 27, 2024

LeonarddeR mentioned this issue May 27, 2024

Unicode normalization: Enable by default for speech and reconsider character navigation #16616

Closed

Add optional unicode normalization before passing strings to speech or braille #16466

Add optional unicode normalization before passing strings to speech or braille #16466

Comments

LeonarddeR commented Apr 30, 2024 • edited Loading

Is your feature request related to a problem? Please describe.

Describe the solution you'd like

CyrilleB79 commented Apr 30, 2024

LeonarddeR commented Apr 30, 2024 • edited Loading

dkager commented Apr 30, 2024

LeonarddeR commented Apr 30, 2024

thgcode commented May 1, 2024

LeonarddeR commented May 1, 2024

ABuffEr commented May 1, 2024

LeonarddeR commented May 2, 2024

ABuffEr commented May 2, 2024 • edited Loading

LeonarddeR commented May 2, 2024

LeonarddeR commented May 4, 2024

Adriani90 commented May 4, 2024

CyrilleB79 commented May 5, 2024

Adriani90 commented May 21, 2024

Adriani90 commented May 21, 2024

Adriani90 commented May 21, 2024

Adriani90 commented May 21, 2024

LeonarddeR commented May 22, 2024

Adriani90 commented May 22, 2024

LeonarddeR commented May 22, 2024

ABuffEr commented May 22, 2024

Adriani90 commented May 22, 2024

CyrilleB79 commented May 22, 2024

Adriani90 commented May 22, 2024 via email

Adriani90 commented May 22, 2024 via email

Adriani90 commented May 23, 2024 • edited Loading

XLTechie commented May 23, 2024

XLTechie commented May 23, 2024

Adriani90 commented May 23, 2024 via email

LeonarddeR commented May 23, 2024

LeonarddeR commented May 23, 2024

Adriani90 commented May 23, 2024 via email

Adriani90 commented May 25, 2024

Adriani90 commented May 25, 2024

LeonarddeR commented May 25, 2024

LeonarddeR commented May 25, 2024

Adriani90 commented May 25, 2024

Adriani90 commented May 25, 2024

Adriani90 commented May 25, 2024

CyrilleB79 commented May 25, 2024

XLTechie commented May 26, 2024 via email

CyrilleB79 commented May 26, 2024

Adriani90 commented May 26, 2024 • edited Loading

Adriani90 commented May 26, 2024

LeonarddeR commented May 27, 2024

LeonarddeR commented May 27, 2024

LeonarddeR commented Apr 30, 2024 •

edited

Loading

LeonarddeR commented Apr 30, 2024 •

edited

Loading

ABuffEr commented May 2, 2024 •

edited

Loading

Adriani90 commented May 23, 2024 •

edited

Loading

Adriani90 commented May 26, 2024 •

edited

Loading