Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Test major functions with the Big List of Naughty Strings #726

Open
bact opened this issue Oct 11, 2022 · 4 comments
Open

Test major functions with the Big List of Naughty Strings #726

bact opened this issue Oct 11, 2022 · 4 comments
Labels
enhancement enhance functionalities Hacktoberfest for Hacktoberfest event help wanted no contributor yet
Projects
Milestone

Comments

@bact
Copy link
Member

bact commented Oct 11, 2022

Detailed description

Add test with strings from the Big List of Naughty Strings, to test robustness of the library.

Context

The Big List of Naughty String is "an evolving list of strings which have a high probability of causing issues when used as user-input data." For example, a string with zero-width space (U+200B).

As a text processing library that has to deal with strings of all sorts, both from user-input and from data archive, it is expected that the library should be robust enough to handle variety of character combinations.

Possible implementation

  • Run major functions through the string list
  • First step, aim the lib not to break and return a correct output type in a reasonable time.
  • Correctness is not the goal for now (as there's no definition of correctness yet).
@bact bact added the enhancement enhance functionalities label Oct 11, 2022
@bact bact added this to the Future milestone Oct 11, 2022
@bact bact added help wanted no contributor yet Hacktoberfest for Hacktoberfest event labels Oct 4, 2023
@bact bact added this to To do in PyThaiNLP Oct 4, 2023
@pavaris-pm
Copy link
Contributor

pavaris-pm commented Oct 9, 2023

@bact i think that making a library to be robust enough to handle variety of character combinations is a very great idea!. Did you mean that adding strings from the Big List of Naughty Strings as an additional test case with all related functions inside pythainlp/tests/ directory which may cause an issue when encounter with Naughty Strings like tokenize or translate engine ?

@bact
Copy link
Member Author

bact commented Oct 9, 2023

@pavaris-pm Exactly. We can start with tokenize tests.

@pavaris-pm
Copy link
Contributor

pavaris-pm commented Oct 9, 2023

@pavaris-pm Exactly. We can start with tokenize tests.

Cool! i will try with that first. Do we need to test will all naughty string in that repo? or just some sample of it is ok. Since the Big List of Naughty Strings repo itself has a blns.txt file that divided it into categories like reverse string , special characters, etc. According to that, we can choose about 2-3 strings in each categories in order to test a library with each categories. What do you think? @bact

@bact
Copy link
Member Author

bact commented Oct 12, 2023

blns.txt contains 496 test strings (I use egrep -cv '#|^\s*$' blns.txt). Maybe not too big to test all?

But if it takes too long time in the test (which may affect productivity) we can focus on relevant categories.

I would say these categories are more relevant

Group 1: (non-)whitespaces and control characters - as they occurred a lot and sometimes our regular expressions may not well covered them:

#       Special Characters
# ASCII punctuation. 
# Non-whitespace C0 controls:
# Non-whitespace C1 controls:
# Whitespace:
# Unicode additional control characters:
# "Byte order marks",

Group 2: string-length related: some non-careful string manipulation may breaks some of these strings. For this group, I think the expected behavior for the testing is for any f(), len(str) should equal to len(f(str)). But I may wrong.

#       Unicode Symbols
#	Two-Byte Characters
#	Strings which contain two-byte letters: 
#	Special Unicode Characters Union
#	Changing length when lowercased
#	Japanese Emoticons
#	Emoji

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement enhance functionalities Hacktoberfest for Hacktoberfest event help wanted no contributor yet
Projects
PyThaiNLP
  
To do
Development

No branches or pull requests

2 participants