Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

chore: add word count regression test #777

Closed

Conversation

LucasXu0
Copy link
Collaborator

No description provided.

@LucasXu0
Copy link
Collaborator Author

LucasXu0 commented Apr 23, 2024

@Xazin @richardshiue I added a regression test for the word count regex, but I still have not found the most accurate regex to satisfy the texts. I have tried using '\b\w+\b' and '\S+'.

All the tests can pass except for the multiple languages test when using '\b\w+\b', and the 'Hello@world#today' test will fail when using '\S+'.

@richardshiue
Copy link
Contributor

richardshiue commented Apr 23, 2024

I think it would be better to revert to the one that guarantees accurate behavior in latin glyphs or the one you wrote: \b\w+\b. For other glyphs, it would be better to use a more powerful tool than regex

@Xazin
Copy link
Collaborator

Xazin commented Apr 23, 2024

What is the best behavior?

Content creators don't use words like something!something, in fact if they ever did do something similar it would be a link, an email, etcetera. And in those cases it would be more reliable to count them as one word.

The current regex is the most accurate in my opinion at the moment.

I intend to build a library that will detect the language, or be given a language, and then use the most appropriate pattern. But I am not sure it is a priority.

It highly depends on the percentage of our users that need word count in an unorthodox manner, which I don't believe many do.

If we can find a simple regex to satisfy the basic needs of the user base, then that would be the best.

But I also wasn't able to find one reliable that could suit English/latin, Arabic, and Chinese/Japanese/Korean.

@Xazin
Copy link
Collaborator

Xazin commented Apr 23, 2024

For Arabic/Cyrillic:
"[A-z\u0600-\u065F\u066A-\u06EF\u06FA-\u06FF]+"g

For Latin:
/\b\w+\b/ug - should do it, with support for counting a-b as two words, same with we're.

For Chinese, I assume one of the ways to do it is by counting the characters and multiplying by a factor, same goes for Korean and Thai.

Factor CH: 0.7
Factor KO: 0.5
Factor Thai: 0.25

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants