TextInfo move by codepoint characters function #16219

mltony · 2024-02-24T01:44:58Z

Link to issue number:

This function is needed for both #8518 and #16050.

Summary of the issue:

Suppose we have TextInfo that represents a paragraph of text:

> s = paragraphInfo.text
> s
'Hello, world!\r'

Suppose that we would like to put the cursor at the first letter of the word 'world'.
That means jumping to index 7:

> s[7:]
'world!\r'

The problem is that calling paragraphInfo.move(UNIT_CHARACTER, 7, "start") is not guaranteed to achieve desired effect. There are two main reasons for that:

In Wide character encoding, some 4-byte unicode characters are represented as two surrogate characters, whereas in pythonic string they would be represented by a single character.
In non-offset TextInfos (e.g. UIATextInfo) there is no guarantee on the fact that TextInfos.move(UNIT_CHARACTER, 1)would actually move by exactly 1 character. A good illustration of this is in Microsoft Word with UIA enabled always, the first character of a bullet list item would be represented by three pythonic characters:
- Bullet character "•"
- Tab character \t
- And the first character of of list item per se.
The third problem of TextInfo.move(UNIT_CHARACTER) function is its performance in some implementations. In particular, moving by 10000 characters in Notepad++ takes over a second on a reasonably modern PC. I might not need to move by 10000 characters in my upcoming PRs, but I will need to move by a few thousands for sure since for sentence navigation I would need to move within a paragraph and some large paragraphs in typical texts can easily be few thousands characters. I need to find both beginning and end textInfos, and if each operation takes say 200ms, then we'd be wasting almost half a second on just moving by characters. Since there were previous concerns about sentence navigation being not fast enough, II would like to introduce this efficient implementation.

Here is how this can be done efficiently using this PR:

> info = paragraphInfo.moveToPythonicOffset(7)
> info.setEndPoint(paragraphInfo, "endToEnd")
> info.text
'world!\r'

Description of user facing changes

N/A

Description of development approach

For general case, I implemented binary-search-like algorithm. I explained it in great detail in the code. Please see def moveToPythonicOffset function in textInfos\__init__.py.
I provided optimized implementations for OffsetsTextInfo and CompoundTextInfo.
I refactored textUtils.py making it conformant to OOP style. I implemented UTF8OffsetConverter and dummy IdentityOffsetConverter as well as their abstract base class and a function getOffsetConverter that selects correct converter based on encoding. I renamed a couple of methods of WideStringOffsetConverter in order to remove the word wide - as now I would like to use similar methods for UTF8 converter, and it has nothing to do with wide strings.

Testing strategy:

Unit tests
Tested in Notepad in presence of tricky 😂 characters (UIATextInfo).
Tested in Notepad++ in presence of 😂 characters (OffsetsTextInfo, UTF-8 encoding).
Tested in VSCode, in presence of 😂 characters (CompoundTextInfo, containing OffsetTextInfo with UTF-16 encoding).
Tested in Microsoft Word with bullet lists (UIATextInfo).
Tested in Chrome browse mode in presence of 😂 characters (OffsetsTextInfo,UTF-16 encoding).
Repeated above tests with Chinese, Arabic and Cyrillic characters.
Tested navigation by word using review cursor in browsers - this code path uses WideStringOffsetConverter - just to make sure my refactoring didn't break this class.

Known issues with pull request:

N/A

Code Review Checklist:

Documentation:
- Change log entry
- User Documentation
- Developer / Technical Documentation
- Context sensitive help for GUI changes
Testing:
- Unit tests
- System (end to end) tests
- Manual testing
UX of all users considered:
- Speech
- Braille
- Low Vision
- Different web browsers
- Localization in other languages / culture than English
API is compatible with existing add-ons.
Security precautions taken.

mltony · 2024-02-24T01:55:44Z

Also:

I haven't implemented unit tests yet. Given the complexity of binary search algorithm, I would want to implement them before merging. But I just want to see first if NVAccess is in general aligned with this approach before spending more time on this.
I anticipate to be asked "why add a new function - why we can't add a new unit to existing move function". Let me try to address this proactively - here are a few reasons why I didn't go this way:
- All use cases I have in mind would need to find a certain character within already existing paragraph textInfo. If we want to jump to a certain character of a paragraph, that implies we already have paragraph TextInfo and we're sure that the desired character lies somewhere inside. It would make sense to use this information since it's readily available instead of trying to perform move on an arbitrary and possibly collapsed textInfo.
- As I mentioned before - performance is a concern. If we want to extend existing find() function, we'd have to have some sort of autoexpanding loop first, that first tries to figure out what would be enclosing textInfo that contains desired offset, and then call existing algorithm to actually find that offset. That seems wasteful to me and as I mentioned before, in any plausible use case we already have enclosing TextInfo available.

AppVeyorBot · 2024-02-24T02:20:27Z

PASS: Translation comments check.
FAIL: Unit tests. See test results for more information.
PASS: Lint check.
PASS: System tests (tags: installer NVDA).
Build (for testing PR): https://ci.appveyor.com/api/buildjobs/5dwtfce8f7kafajo/artifacts/output/nvda_snapshot_pr16219-31160,18fc288e.exe
CI timing (mins):
INIT 0.0,
INSTALL_START 1.0,
INSTALL_END 1.3,
BUILD_START 0.0,
BUILD_END 29.2,
TESTSETUP_START 0.0,
TESTSETUP_END 0.3,
TEST_START 0.0,
TEST_END 2.2,
FINISH_END 0.2

See test results for failed build of commit 18fc288efd

Adriani90 · 2024-02-24T11:30:21Z

@mltony thanks for this. Very interesting. Can this be used also for character by character or word by word navigation? If yes, then this would probably open up a good way to solve also #11908, #2649, #13712, or #4431.

Also does this in your testing workwwell with compund characters in Asian or Arabic languages?

source/textInfos/__init__.py

source/textUtils.py

Co-authored-by: Łukasz Golonka <[email protected]>

mltony · 2024-02-24T22:42:45Z

@Adriani90,
This won't help with double reading bullet points in MSWord. This is primarily for my PRs in progress: sentence navigation and style navigation.
Not sure what you mean by compound characters, but just checked with Chinese, Arabic and Cyrillic characters - works as expected.

AppVeyorBot · 2024-02-24T23:20:09Z

PASS: Translation comments check.
PASS: Unit tests.
PASS: Lint check.
FAIL: System tests (tags: installer NVDA). See test results for more information.
Build (for testing PR): https://ci.appveyor.com/api/buildjobs/ja2abuje3l6xge6v/artifacts/output/nvda_snapshot_pr16219-31181,5b3ed7e7.exe
CI timing (mins):
INIT 0.0,
INSTALL_START 1.0,
INSTALL_END 0.9,
BUILD_START 0.0,
BUILD_END 28.7,
TESTSETUP_START 0.0,
TESTSETUP_END 0.4,
TEST_START 0.0,
TEST_END 2.3,
FINISH_END 0.2

See test results for failed build of commit 5b3ed7e7d6

LeonarddeR · 2024-02-26T12:14:59Z

I really like this approach, but I'm not sure about the name pythonic.
According to the Python docs:

Strings are immutable sequences of Unicode code points.

So may be code point offset is better.

mltony · 2024-02-26T22:49:29Z

@LeonarddeR, Would be happy to rename it to something else more intuitive. But not sure that code point offset is more intuitive - I bet the term code point will definitely require people to look up the meaning of - whereas the term Pythonic is at least familiar to those devs who are aware of the difference between different offset schemes.

mltony · 2024-03-01T20:40:31Z

@seanbudd, could you take a look at this one?
This is blocking #16050 and and my WIP PR for #8518.

AppVeyorBot · 2024-03-05T22:45:55Z

PASS: Translation comments check.
PASS: Unit tests.
FAIL: Lint check. See test results for more information.
PASS: System tests (tags: installer NVDA).
Build (for testing PR): https://ci.appveyor.com/api/buildjobs/u11udwpgh8yiey91/artifacts/output/nvda_snapshot_pr16219-31294,432c1678.exe
CI timing (mins):
INIT 0.0,
INSTALL_START 1.0,
INSTALL_END 0.9,
BUILD_START 0.0,
BUILD_END 30.2,
TESTSETUP_START 0.0,
TESTSETUP_END 0.3,
TEST_START 0.0,
TEST_END 2.4,
FINISH_END 0.2

See test results for failed build of commit 432c167854

mltony · 2024-03-11T18:42:11Z

@michaelDCurran, wondering if you can take a look at this PR? You gave green light for sentence navigation - this one is apparently required to make sentence navigation to work properly.

michaelDCurran · 2024-03-11T21:35:58Z

@mltony I agree with @LeonarddeR re the name. Pythonic is way too general... this is all "Python". Please rename it to moveToCodepointOffset.
I have looked at all code except textUtils.py so far. Looks pretty good to me. Please go ahead and start writing unit tests where you can.

mltony · 2024-03-19T18:51:38Z

@michaelDCurran, I addressed your comments, please have another look.

Renamed to moveToCodepointOffset
Added unit tests

CyrilleB79 · 2024-03-19T22:14:37Z

@mltony could you also update the title and the initial description of this PR?
Pythonic ofsset -> codepoint offset
Thanks.

mltony · 2024-04-07T22:58:16Z

@michaelDCurran, @seanbudd, kindly ping - can either of you review this? This PR is blocking sentence navigation PR and style navigation PR. My leave is coming to an end soon and I really wanted to contribute these two PRs to NVDA before I go back to full time work.

michaelDCurran · 2024-04-07T23:46:09Z

This pr removes / renames public symbols in textUtils.py? Can you state how you have handled add-on compatibility? Assuming we can guarantee compatibility for with 2024.1, then I'm happy to approve.

mltony · 2024-04-08T05:25:17Z

In textUtils.py lines 236..238 I declare old function names and just assign new function names to them. Just like aliases. So any add-on calling these functions won't be affected. That only applies to WideStringOffsetConverter class - since that's the only class in this file before this PR.

TextInfo move by pythonic characters function

d89dc3f

lukaszgo1 reviewed Feb 24, 2024

View reviewed changes

mltony and others added 6 commits February 24, 2024 13:42

Adding back old function names for compatibility

d0e4ff8

Fix unit test

c15d741

doc

8c98a88

Update source/textInfos/__init__.py

860e02c

Co-authored-by: Łukasz Golonka <[email protected]>

Update source/textInfos/__init__.py

fa35715

Co-authored-by: Łukasz Golonka <[email protected]>

Update source/textInfos/__init__.py

fc80dec

Co-authored-by: Łukasz Golonka <[email protected]>

mltony marked this pull request as ready for review February 24, 2024 22:24

mltony requested a review from a team as a code owner February 24, 2024 22:24

mltony requested a review from seanbudd February 24, 2024 22:24

minor fix

3fdcfe0

seanbudd requested a review from michaelDCurran March 5, 2024 01:22

Modernizing legacy code

68880c5

lint

1a088de

seanbudd marked this pull request as draft March 12, 2024 06:28

seanbudd added the conceptApproved Similar 'triaged' for issues, PR accepted in theory, implementation needs review. label Mar 12, 2024

mltony added 2 commits March 16, 2024 10:50

renaming

4ad7678

Unit tests

4f6a131

Merge branch 'master' into offsets

3ccaeab

mltony marked this pull request as ready for review March 19, 2024 18:54

mltony changed the title ~~TextInfo move by pythonic characters function~~ TextInfo move by codepoint characters function Mar 19, 2024

michaelDCurran approved these changes Apr 8, 2024

View reviewed changes

michaelDCurran merged commit 2238cd9 into nvaccess:master Apr 8, 2024
1 check passed

nvaccessAuto added this to the 2024.2 milestone Apr 8, 2024

This was referenced May 2, 2024

Add optional unicode normalization before passing strings to speech or braille #16466

Closed

Braille: Variation Selectors break cursor positions #10960

Open

Adriani90 mentioned this pull request May 17, 2024

Speak typed words based on TextInfo if possible #8110

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TextInfo move by codepoint characters function #16219

TextInfo move by codepoint characters function #16219

mltony commented Feb 24, 2024 •

edited

mltony commented Feb 24, 2024

AppVeyorBot commented Feb 24, 2024

Adriani90 commented Feb 24, 2024

mltony commented Feb 24, 2024

AppVeyorBot commented Feb 24, 2024

LeonarddeR commented Feb 26, 2024

mltony commented Feb 26, 2024

mltony commented Mar 1, 2024

AppVeyorBot commented Mar 5, 2024

mltony commented Mar 11, 2024

michaelDCurran commented Mar 11, 2024

mltony commented Mar 19, 2024

CyrilleB79 commented Mar 19, 2024

mltony commented Apr 7, 2024

michaelDCurran commented Apr 7, 2024

mltony commented Apr 8, 2024

TextInfo move by codepoint characters function #16219

TextInfo move by codepoint characters function #16219

Conversation

mltony commented Feb 24, 2024 • edited

Link to issue number:

Summary of the issue:

Description of user facing changes

Description of development approach

Testing strategy:

Known issues with pull request:

Code Review Checklist:

mltony commented Feb 24, 2024

AppVeyorBot commented Feb 24, 2024

Adriani90 commented Feb 24, 2024

mltony commented Feb 24, 2024

AppVeyorBot commented Feb 24, 2024

LeonarddeR commented Feb 26, 2024

mltony commented Feb 26, 2024

mltony commented Mar 1, 2024

AppVeyorBot commented Mar 5, 2024

mltony commented Mar 11, 2024

michaelDCurran commented Mar 11, 2024

mltony commented Mar 19, 2024

CyrilleB79 commented Mar 19, 2024

mltony commented Apr 7, 2024

michaelDCurran commented Apr 7, 2024

mltony commented Apr 8, 2024

mltony commented Feb 24, 2024 •

edited