Tokenize Quotes #8918

NickWingate · 2024-04-28T23:56:59Z

Resolves: Double Space in "Find and Replace" = Searches for Single Space #8103
Target version: master

Summary

When ignoreQuotes = false delimiters within quotes (' or ") aren't tokenised.
Causes problems especially as space is the default delimiter, so if the input string has any key="string" pattern tokens a space will cause issues

For #8103
Unocommand was being incorrectly tokenized as space is the default delimiter:

uno .uno:ExecuteSearch {"SearchItem.SearchString":{"type":"string","value":" "}, ...
^^^

TODO

...

Checklist

Code is properly formatted
All commits have Change-Id
I have run tests with make check
I have issued make run and manually verified that everything looks okay
Documentation (manuals or wiki) has been updated or is not required

Ashod

Thanks, @NickWingate. This is a good bug to fix.

I worry about the efficiency of the code, though. Given how frequently the tokenizers are called, we really can't afford to have an algorithm that's not linear. I left some notes and tips, but please feel free to reach out if need further help. Re-writing these functions to be optimal might not be trivial, so I suggest splitting them, since it seems we only care about supporting quotes in exactly one case. That single case can be less efficient, until we improve the implementation. I just don't want to make all the command processing code slower because of that single case.

common/StringVector.hpp

tokenize_outside_quotes tokenises a given string, ignoring delimiters that are in quotes. Previously tokenize() caused problems especially as space is the default delimiter, so if the input string has any key="str ing" tokens a space can cause errors Signed-off-by: NickWingate <[email protected]> Change-Id: I54578646a19a1cfea18c9f73c6b25d46c61ed20f

Unocommand was being incorrectly tokenized as space is the default delimiter: uno .uno:ExecuteSearch {"SearchItem.SearchString":{"type":"string","value":" "}, ... ^^^ was being incorrectly parsed. Signed-off-by: NickWingate <[email protected]> Change-Id: I572f4eff0f23bf9d5c275fe0acad74e9633a9b9d

NickWingate · 2024-05-02T12:35:45Z

Made a new function tokenize_outside_quotes for handling cases where the data may have quotes.
Also changed to a new algorithm as suggested: go character by character and when a quote is found stop looking for the delimiter until quote is closed. When the delimiter is character this is simple, when the character is a string we use s.substr(i, len) == delimiter to compare.

Additionally tokenize_outside_quotes should handle escaped quotes e.g. "apost\\'phe don\\'t" has a space betweeen escaped single quotes and should be tokenized.
Currently this is tokenized into apost\'phe, don\'t but perhaps we should remove the backslash? -- >(apost'phe, don't)

Ashod · 2024-05-25T12:26:42Z

Made a new function tokenize_outside_quotes for handling cases where the data may have quotes. Also changed to a new algorithm as suggested: go character by character and when a quote is found stop looking for the delimiter until quote is closed. When the delimiter is character this is simple, when the character is a string we use s.substr(i, len) == delimiter to compare.

Thank you, @NickWingate, for the update.

I went through the code, but, before commenting on the code, I'm wondering why there is no test with the reported issue of searching for " " (two spaces). Shouldn't one expect that if this resolves the reported issue, and assuming the issue is related to a bug in the tokenizer, that a test would include a reproducer of said issue?

Also, I don't understand why we have implemented so many variations of the tokenizer when it would seem that we only use one. Perhaps I missed something, but I don't see the code using all these variations.

Finally, if we have to implement a tokenizer that takes quotes into account, why do we care to implement a special version for when there are no quotes? Is that an optimization? Have we measured the three versions?

Thank you.

NickWingate · 2024-07-23T12:52:48Z

I went through the code, but, before commenting on the code, I'm wondering why there is no test with the reported issue of searching for " " (two spaces). Shouldn't one expect that if this resolves the reported issue, and assuming the issue is related to a bug in the tokenizer, that a test would include a reproducer of said issue?

Yes, good point. I will add the test.

Also, I don't understand why we have implemented so many variations of the tokenizer when it would seem that we only use one. Perhaps I missed something, but I don't see the code using all these variations.

No, I just created further overloads of tokenize_outside_quotes where the original tokenize has these overloads already so if someone has a similar issue, but when using string delimiter instead of a char delimiter they can use this function without rewriting the tokenize. It seemed sensible at the time, but if you think this is just bloat I can remove it.

Finally, if we have to implement a tokenizer that takes quotes into account, why do we care to implement a special version for when there are no quotes? Is that an optimization? Have we measured the three versions?

I did this as suggested:

Re-writing these functions to be optimal might not be trivial, so I suggest splitting them, since it seems we only care about supporting quotes in exactly one case

In each tokenize_outside_of_quotes we check that if the delimiter is/contains a quote then default back to the normal tokenize. I thought this was easiest solution for now but if you have a better idea I can implement that instead?

pedropintosilva requested review from gokaysatir and Ashod and removed request for gokaysatir April 29, 2024 08:15

Ashod requested changes Apr 29, 2024

View reviewed changes

NickWingate force-pushed the private/nickwingate/tokenize-quotes branch from 24130fe to c48adf3 Compare May 2, 2024 12:23

NickWingate force-pushed the private/nickwingate/tokenize-quotes branch from c48adf3 to eda6720 Compare May 2, 2024 12:27

timar force-pushed the master branch from ebbad3c to 61cf2b4 Compare August 29, 2024 20:13

eszkadev added the CODE "red" label Sep 23, 2024

eszkadev added the draft label Oct 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tokenize Quotes #8918

Tokenize Quotes #8918

NickWingate commented Apr 28, 2024

Ashod left a comment

NickWingate commented May 2, 2024

Ashod commented May 25, 2024

NickWingate commented Jul 23, 2024

Tokenize Quotes #8918

Are you sure you want to change the base?

Tokenize Quotes #8918

Conversation

NickWingate commented Apr 28, 2024

Summary

TODO

Checklist

Ashod left a comment

Choose a reason for hiding this comment

NickWingate commented May 2, 2024

Ashod commented May 25, 2024

NickWingate commented Jul 23, 2024