Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TextInfo move by codepoint characters function #16219

Merged
merged 13 commits into from
Apr 8, 2024

Conversation

mltony
Copy link
Contributor

@mltony mltony commented Feb 24, 2024

Link to issue number:

This function is needed for both #8518 and #16050.

Summary of the issue:

Suppose we have TextInfo that represents a paragraph of text:

> s = paragraphInfo.text
> s
'Hello, world!\r'

Suppose that we would like to put the cursor at the first letter of the word 'world'.
That means jumping to index 7:

> s[7:]
'world!\r'

The problem is that calling paragraphInfo.move(UNIT_CHARACTER, 7, "start") is not guaranteed to achieve desired effect. There are two main reasons for that:

  1. In Wide character encoding, some 4-byte unicode characters are represented as two surrogate characters, whereas in pythonic string they would be represented by a single character.
  2. In non-offset TextInfos (e.g. UIATextInfo) there is no guarantee on the fact that TextInfos.move(UNIT_CHARACTER, 1)would actually move by exactly 1 character. A good illustration of this is in Microsoft Word with UIA enabled always, the first character of a bullet list item would be represented by three pythonic characters:
    • Bullet character "•"
    • Tab character \t
    • And the first character of of list item per se.
  3. The third problem of TextInfo.move(UNIT_CHARACTER) function is its performance in some implementations. In particular, moving by 10000 characters in Notepad++ takes over a second on a reasonably modern PC. I might not need to move by 10000 characters in my upcoming PRs, but I will need to move by a few thousands for sure since for sentence navigation I would need to move within a paragraph and some large paragraphs in typical texts can easily be few thousands characters. I need to find both beginning and end textInfos, and if each operation takes say 200ms, then we'd be wasting almost half a second on just moving by characters. Since there were previous concerns about sentence navigation being not fast enough, II would like to introduce this efficient implementation.

Here is how this can be done efficiently using this PR:

> info = paragraphInfo.moveToPythonicOffset(7)
> info.setEndPoint(paragraphInfo, "endToEnd")
> info.text
'world!\r'

Description of user facing changes

N/A

Description of development approach

  1. For general case, I implemented binary-search-like algorithm. I explained it in great detail in the code. Please see def moveToPythonicOffset function in textInfos\__init__.py.
  2. I provided optimized implementations for OffsetsTextInfo and CompoundTextInfo.
  3. I refactored textUtils.py making it conformant to OOP style. I implemented UTF8OffsetConverter and dummy IdentityOffsetConverter as well as their abstract base class and a function getOffsetConverter that selects correct converter based on encoding. I renamed a couple of methods of WideStringOffsetConverter in order to remove the word wide - as now I would like to use similar methods for UTF8 converter, and it has nothing to do with wide strings.

Testing strategy:

  1. Unit tests
  2. Tested in Notepad in presence of tricky 😂 characters (UIATextInfo).
  3. Tested in Notepad++ in presence of 😂 characters (OffsetsTextInfo, UTF-8 encoding).
  4. Tested in VSCode, in presence of 😂 characters (CompoundTextInfo, containing OffsetTextInfo with UTF-16 encoding).
  5. Tested in Microsoft Word with bullet lists (UIATextInfo).
  6. Tested in Chrome browse mode in presence of 😂 characters (OffsetsTextInfo,UTF-16 encoding).
  7. Repeated above tests with Chinese, Arabic and Cyrillic characters.
  8. Tested navigation by word using review cursor in browsers - this code path uses WideStringOffsetConverter - just to make sure my refactoring didn't break this class.

Known issues with pull request:

N/A

Code Review Checklist:

  • Documentation:
    • Change log entry
    • User Documentation
    • Developer / Technical Documentation
    • Context sensitive help for GUI changes
  • Testing:
    • Unit tests
    • System (end to end) tests
    • Manual testing
  • UX of all users considered:
    • Speech
    • Braille
    • Low Vision
    • Different web browsers
    • Localization in other languages / culture than English
  • API is compatible with existing add-ons.
  • Security precautions taken.

@mltony
Copy link
Contributor Author

mltony commented Feb 24, 2024

Also:

  1. I haven't implemented unit tests yet. Given the complexity of binary search algorithm, I would want to implement them before merging. But I just want to see first if NVAccess is in general aligned with this approach before spending more time on this.
  2. I anticipate to be asked "why add a new function - why we can't add a new unit to existing move function". Let me try to address this proactively - here are a few reasons why I didn't go this way:
    • All use cases I have in mind would need to find a certain character within already existing paragraph textInfo. If we want to jump to a certain character of a paragraph, that implies we already have paragraph TextInfo and we're sure that the desired character lies somewhere inside. It would make sense to use this information since it's readily available instead of trying to perform move on an arbitrary and possibly collapsed textInfo.
    • As I mentioned before - performance is a concern. If we want to extend existing find() function, we'd have to have some sort of autoexpanding loop first, that first tries to figure out what would be enclosing textInfo that contains desired offset, and then call existing algorithm to actually find that offset. That seems wasteful to me and as I mentioned before, in any plausible use case we already have enclosing TextInfo available.

@AppVeyorBot
Copy link

See test results for failed build of commit 18fc288efd

@Adriani90
Copy link
Collaborator

@mltony thanks for this. Very interesting. Can this be used also for character by character or word by word navigation? If yes, then this would probably open up a good way to solve also #11908, #2649, #13712, or #4431.

Also does this in your testing workwwell with compund characters in Asian or Arabic languages?

source/textInfos/__init__.py Outdated Show resolved Hide resolved
source/textInfos/__init__.py Outdated Show resolved Hide resolved
source/textInfos/__init__.py Outdated Show resolved Hide resolved
source/textInfos/__init__.py Outdated Show resolved Hide resolved
source/textUtils.py Outdated Show resolved Hide resolved
source/textUtils.py Outdated Show resolved Hide resolved
source/textUtils.py Outdated Show resolved Hide resolved
source/textUtils.py Show resolved Hide resolved
@mltony mltony marked this pull request as ready for review February 24, 2024 22:24
@mltony mltony requested a review from a team as a code owner February 24, 2024 22:24
@mltony mltony requested a review from seanbudd February 24, 2024 22:24
@mltony
Copy link
Contributor Author

mltony commented Feb 24, 2024

@Adriani90,
This won't help with double reading bullet points in MSWord. This is primarily for my PRs in progress: sentence navigation and style navigation.
Not sure what you mean by compound characters, but just checked with Chinese, Arabic and Cyrillic characters - works as expected.

@AppVeyorBot
Copy link

See test results for failed build of commit 5b3ed7e7d6

@LeonarddeR
Copy link
Collaborator

I really like this approach, but I'm not sure about the name pythonic.
According to the Python docs:

Strings are immutable sequences of Unicode code points.

So may be code point offset is better.

@mltony
Copy link
Contributor Author

mltony commented Feb 26, 2024

@LeonarddeR, Would be happy to rename it to something else more intuitive. But not sure that code point offset is more intuitive - I bet the term code point will definitely require people to look up the meaning of - whereas the term Pythonic is at least familiar to those devs who are aware of the difference between different offset schemes.

@mltony
Copy link
Contributor Author

mltony commented Mar 1, 2024

@seanbudd, could you take a look at this one?
This is blocking #16050 and and my WIP PR for #8518.

@AppVeyorBot
Copy link

See test results for failed build of commit 432c167854

@mltony
Copy link
Contributor Author

mltony commented Mar 11, 2024

@michaelDCurran, wondering if you can take a look at this PR? You gave green light for sentence navigation - this one is apparently required to make sentence navigation to work properly.

@michaelDCurran
Copy link
Member

@mltony I agree with @LeonarddeR re the name. Pythonic is way too general... this is all "Python". Please rename it to moveToCodepointOffset.
I have looked at all code except textUtils.py so far. Looks pretty good to me. Please go ahead and start writing unit tests where you can.

@seanbudd seanbudd marked this pull request as draft March 12, 2024 06:28
@seanbudd seanbudd added the conceptApproved Similar 'triaged' for issues, PR accepted in theory, implementation needs review. label Mar 12, 2024
@mltony
Copy link
Contributor Author

mltony commented Mar 19, 2024

@michaelDCurran, I addressed your comments, please have another look.

  • Renamed to moveToCodepointOffset
  • Added unit tests

@mltony mltony marked this pull request as ready for review March 19, 2024 18:54
@CyrilleB79
Copy link
Collaborator

@mltony could you also update the title and the initial description of this PR?
Pythonic ofsset -> codepoint offset
Thanks.

@mltony mltony changed the title TextInfo move by pythonic characters function TextInfo move by codepoint characters function Mar 19, 2024
@mltony
Copy link
Contributor Author

mltony commented Apr 7, 2024

@michaelDCurran, @seanbudd, kindly ping - can either of you review this? This PR is blocking sentence navigation PR and style navigation PR. My leave is coming to an end soon and I really wanted to contribute these two PRs to NVDA before I go back to full time work.

@michaelDCurran
Copy link
Member

This pr removes / renames public symbols in textUtils.py? Can you state how you have handled add-on compatibility? Assuming we can guarantee compatibility for with 2024.1, then I'm happy to approve.

@mltony
Copy link
Contributor Author

mltony commented Apr 8, 2024

In textUtils.py lines 236..238 I declare old function names and just assign new function names to them. Just like aliases. So any add-on calling these functions won't be affected. That only applies to WideStringOffsetConverter class - since that's the only class in this file before this PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
conceptApproved Similar 'triaged' for issues, PR accepted in theory, implementation needs review.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

9 participants