chore: add word count regression test #777

LucasXu0 · 2024-04-23T01:56:53Z

No description provided.

LucasXu0 · 2024-04-23T01:58:58Z

@Xazin @richardshiue I added a regression test for the word count regex, but I still have not found the most accurate regex to satisfy the texts. I have tried using '\b\w+\b' and '\S+'.

All the tests can pass except for the multiple languages test when using '\b\w+\b', and the 'Hello@world#today' test will fail when using '\S+'.

richardshiue · 2024-04-23T02:34:12Z

I think it would be better to revert to the one that guarantees accurate behavior in latin glyphs or the one you wrote: \b\w+\b. For other glyphs, it would be better to use a more powerful tool than regex

Xazin · 2024-04-23T09:26:46Z

What is the best behavior?

Content creators don't use words like something!something, in fact if they ever did do something similar it would be a link, an email, etcetera. And in those cases it would be more reliable to count them as one word.

The current regex is the most accurate in my opinion at the moment.

I intend to build a library that will detect the language, or be given a language, and then use the most appropriate pattern. But I am not sure it is a priority.

It highly depends on the percentage of our users that need word count in an unorthodox manner, which I don't believe many do.

If we can find a simple regex to satisfy the basic needs of the user base, then that would be the best.

But I also wasn't able to find one reliable that could suit English/latin, Arabic, and Chinese/Japanese/Korean.

Xazin · 2024-04-23T16:19:15Z

For Arabic/Cyrillic:
"[A-z\u0600-\u065F\u066A-\u06EF\u06FA-\u06FF]+"g

For Latin:
/\b\w+\b/ug - should do it, with support for counting a-b as two words, same with we're.

For Chinese, I assume one of the ways to do it is by counting the characters and multiplying by a factor, same goes for Korean and Thai.

Factor CH: 0.7
Factor KO: 0.5
Factor Thai: 0.25

chore: add word count regression test

cecdd20

Xazin approved these changes Apr 23, 2024

View reviewed changes

Xazin self-assigned this Apr 27, 2024

LucasXu0 force-pushed the main branch from b3874cd to f2eee69 Compare May 22, 2024 01:50

LucasXu0 closed this Jul 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

chore: add word count regression test #777

chore: add word count regression test #777

LucasXu0 commented Apr 23, 2024

LucasXu0 commented Apr 23, 2024 •

edited

Loading

richardshiue commented Apr 23, 2024 •

edited

Loading

Xazin commented Apr 23, 2024

Xazin commented Apr 23, 2024

chore: add word count regression test #777

chore: add word count regression test #777

Conversation

LucasXu0 commented Apr 23, 2024

LucasXu0 commented Apr 23, 2024 • edited Loading

richardshiue commented Apr 23, 2024 • edited Loading

Xazin commented Apr 23, 2024

Xazin commented Apr 23, 2024

LucasXu0 commented Apr 23, 2024 •

edited

Loading

richardshiue commented Apr 23, 2024 •

edited

Loading