Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tokenize Quotes #8918

Open
wants to merge 2 commits into
base: master
Choose a base branch
from
Open

Conversation

NickWingate
Copy link
Contributor

Summary

When ignoreQuotes = false delimiters within quotes (' or ") aren't tokenised.
Causes problems especially as space is the default delimiter, so if the input string has any key="string" pattern tokens a space will cause issues

For #8103
Unocommand was being incorrectly tokenized as space is the default delimiter:

uno .uno:ExecuteSearch {"SearchItem.SearchString":{"type":"string","value":" "}, ...
^^^

TODO

  • ...

Checklist

  • Code is properly formatted
  • All commits have Change-Id
  • I have run tests with make check
  • I have issued make run and manually verified that everything looks okay
  • Documentation (manuals or wiki) has been updated or is not required

@pedropintosilva pedropintosilva requested review from gokaysatir and Ashod and removed request for gokaysatir April 29, 2024 08:15
Copy link
Contributor

@Ashod Ashod left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, @NickWingate. This is a good bug to fix.

I worry about the efficiency of the code, though. Given how frequently the tokenizers are called, we really can't afford to have an algorithm that's not linear. I left some notes and tips, but please feel free to reach out if need further help. Re-writing these functions to be optimal might not be trivial, so I suggest splitting them, since it seems we only care about supporting quotes in exactly one case. That single case can be less efficient, until we improve the implementation. I just don't want to make all the command processing code slower because of that single case.

tokenize_outside_quotes tokenises a given string,
ignoring delimiters that are in quotes.

Previously tokenize() caused problems especially as space
is the default delimiter, so if the input string has any
key="str ing" tokens a space can cause errors

Signed-off-by: NickWingate <[email protected]>
Change-Id: I54578646a19a1cfea18c9f73c6b25d46c61ed20f
@NickWingate NickWingate force-pushed the private/nickwingate/tokenize-quotes branch from 24130fe to c48adf3 Compare May 2, 2024 12:23
Unocommand was being incorrectly tokenized as
space is the default delimiter:

uno .uno:ExecuteSearch {"SearchItem.SearchString":{"type":"string","value":" "}, ...
                                                                           ^^^
was being incorrectly parsed.

Signed-off-by: NickWingate <[email protected]>
Change-Id: I572f4eff0f23bf9d5c275fe0acad74e9633a9b9d
@NickWingate NickWingate force-pushed the private/nickwingate/tokenize-quotes branch from c48adf3 to eda6720 Compare May 2, 2024 12:27
@NickWingate
Copy link
Contributor Author

Made a new function tokenize_outside_quotes for handling cases where the data may have quotes.
Also changed to a new algorithm as suggested: go character by character and when a quote is found stop looking for the delimiter until quote is closed. When the delimiter is character this is simple, when the character is a string we use s.substr(i, len) == delimiter to compare.

Additionally tokenize_outside_quotes should handle escaped quotes e.g. "apost\\'phe don\\'t" has a space betweeen escaped single quotes and should be tokenized.
Currently this is tokenized into apost\'phe, don\'t but perhaps we should remove the backslash? -- >(apost'phe, don't)

@Ashod
Copy link
Contributor

Ashod commented May 25, 2024

Made a new function tokenize_outside_quotes for handling cases where the data may have quotes. Also changed to a new algorithm as suggested: go character by character and when a quote is found stop looking for the delimiter until quote is closed. When the delimiter is character this is simple, when the character is a string we use s.substr(i, len) == delimiter to compare.

Thank you, @NickWingate, for the update.

I went through the code, but, before commenting on the code, I'm wondering why there is no test with the reported issue of searching for " " (two spaces). Shouldn't one expect that if this resolves the reported issue, and assuming the issue is related to a bug in the tokenizer, that a test would include a reproducer of said issue?

Also, I don't understand why we have implemented so many variations of the tokenizer when it would seem that we only use one. Perhaps I missed something, but I don't see the code using all these variations.

Finally, if we have to implement a tokenizer that takes quotes into account, why do we care to implement a special version for when there are no quotes? Is that an optimization? Have we measured the three versions?

Thank you.

@NickWingate
Copy link
Contributor Author

I went through the code, but, before commenting on the code, I'm wondering why there is no test with the reported issue of searching for " " (two spaces). Shouldn't one expect that if this resolves the reported issue, and assuming the issue is related to a bug in the tokenizer, that a test would include a reproducer of said issue?

Yes, good point. I will add the test.

Also, I don't understand why we have implemented so many variations of the tokenizer when it would seem that we only use one. Perhaps I missed something, but I don't see the code using all these variations.

No, I just created further overloads of tokenize_outside_quotes where the original tokenize has these overloads already so if someone has a similar issue, but when using string delimiter instead of a char delimiter they can use this function without rewriting the tokenize. It seemed sensible at the time, but if you think this is just bloat I can remove it.

Finally, if we have to implement a tokenizer that takes quotes into account, why do we care to implement a special version for when there are no quotes? Is that an optimization? Have we measured the three versions?

I did this as suggested:

Re-writing these functions to be optimal might not be trivial, so I suggest splitting them, since it seems we only care about supporting quotes in exactly one case

In each tokenize_outside_of_quotes we check that if the delimiter is/contains a quote then default back to the normal tokenize. I thought this was easiest solution for now but if you have a better idea I can implement that instead?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: In Progress
Development

Successfully merging this pull request may close these issues.

Double Space in "Find and Replace" = Searches for Single Space
3 participants