-
-
Notifications
You must be signed in to change notification settings - Fork 4.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
lib/fs: Add ASCII fastpath for normalization #9365
Draft
bt90
wants to merge
18
commits into
main
Choose a base branch
from
ascii_fastpath
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Draft
Changes from 16 commits
Commits
Show all changes
18 commits
Select commit
Hold shift + click to select a range
8f744d8
Add ascii fastpath
bt90 18ca1f6
Use WriteByte for single byte runes
bt90 d358a61
Use strings.Map
bt90 79a74db
lower ascii fastpath
bt90 97a209c
Test ascii detection
bt90 3008c6f
Extract test cases
bt90 28002a7
Add ascii vs unicode benchmark
bt90 b2a1dd2
lower ascii noop
bt90 3b8f86b
Rename testcases
bt90 8a9806e
upper ascii fastpath
bt90 90e76b1
WIP singlepass
bt90 de792df
wip extract functions
bt90 946cdbb
early continue
bt90 80a7236
skip ascii aware ToX
bt90 ff0c88c
Add Latin1 fastpath
bt90 ad5c4ae
avoid utf8 import
bt90 dcdf29d
Add more benchmark cases
bt90 37cb464
Rename testcases
bt90 File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given unicode_lowercase is the slow bench result with the new implementatin, maybe it would be worth testing keeping that behaviour from the old code. I.e. when encountering non-ascii, keep checking if there's any upper-case chars (and then exit). So you can then take the lower-case-unicode shortcut from before. This will make the unicode-mixed-case slower, but I wouldn't be surprised if not by a lot compared to the gains for unicode-lower.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's the point I don't really understand. golangs
strings.Map()
has that shortcut built into it:https://cs.opensource.google/go/go/+/refs/tags/go1.21.6:src/strings/strings.go;l=489-517
My previous (flawed?) mixedcase unicode benchmark case should have stopped at the very first byte in the ASCII detection. The rest of the code looks pretty much the same to me? Map the runes and if a case change is detected, assign a string builder with the unchanged part and write mapped runes after that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I completely glossed over the usage of strings.Map xD
That's still one or two more loops. If you did the unicode lower detection in our initial loop that does ascii detection, and then not use
strings.Map
, we get rid of part of a loop in both the unicode cases.Now that also makes me wonder if we could wrap
isASCII
andtoLowerAscii
into a mapping function forstrings.Map
for less code branches/complexity, and equal, maybe even slightly better performance :)There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also we are now mixing two classes of optimisations: Improving things about the unicode path, and adding an ascii shortcut. Maybe worth doing that separately for optimal results. Just an idle, optional suggestion though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The ASCII path iterates over single bytes which is fast but doesn't work as soon as any unicode characters are involved. Iterating over runes is a tad slower but necessary in that case.
So you either end up with two loops or one rune based loop which is slower for ASCII inputs.
I also tried to get rid of
lower(upper(r))
as far as possible as there are only 23(!) characters in unicode that satisfylower(r) != lower(upper(r))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Found it, but apparently the type isn't exported by the unicode package which renders the SpecialCase unusable for us.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Naive code generation attempt:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the exceptions are that few it certainly seems valid to pre-generate the table and then use that plus regular ToLower in order to save on the more complex operations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, the unexported
d
is a bummer. Wonder what the thinking there is.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I played a bit with
unicode.RangeTable
but couldn't improve the performance: