Add support for Chinese and Japanese stop words #507

sarahyurick · 2025-01-31T21:02:42Z

Closes #459.

Related PRs:

Signed-off-by: Sarah Yurick <[email protected]>

sarahyurick · 2025-01-31T21:04:58Z

nemo_curator/download/commoncrawl.py

+      stop_lists: A dictionary stop lists, where the keys are languages (e.g., "ENGLISH")
+        and the values are Python frozensets denoting the list of stop words for that language.
+        If None, it defaults to jusText's stop lists: https://github.com/miso-belica/jusText/tree/main/justext/stoplists,
+        with added Thai, Chinese, and Japanese support.


While I think it's important for NeMo Curator to support Thai, Chinese, and Japanese, I also think it would be a good idea for us to allow users to pass in their own stop lists as a workaround.

This way, if a language is not already supported, the user can do it themselves. Additionally, a user might not like the stop lists provided by jusText and want to pass in their own custom stop lists for that reason, too.

sarahyurick · 2025-01-31T21:08:24Z

nemo_curator/download/commoncrawl.py

@@ -128,6 +128,7 @@ def extract_text(self, html, stop_words):
        paragraphs = handler.paragraphs

        # Context free classification
+        # TODO: Check Thai, Chinese, Japanese, and Korean


Words in Thai, Chinese, Japanese, and Korean are not separated by spaces. I need to make sure to either (1) raise a warning about the stop word logic used by our jusText/Resiliparse extractors (such as suggesting to modify the stopwords_low/stopwords_high/required_stopword_density parameters?) or (2) adding word splitting logic like I did for #320.

Signed-off-by: Sarah Yurick <[email protected]>

add zh and ja stopwords

c0278dd

Signed-off-by: Sarah Yurick <[email protected]>

sarahyurick commented Jan 31, 2025

View reviewed changes

sarahyurick added 6 commits January 31, 2025 13:11

run isort

f135f3f

Signed-off-by: Sarah Yurick <[email protected]>

edit doc

9789d08

Signed-off-by: Sarah Yurick <[email protected]>

indent?

1977940

Signed-off-by: Sarah Yurick <[email protected]>

rst file

3b94ba7

Signed-off-by: Sarah Yurick <[email protected]>

rst?

f2675bf

Signed-off-by: Sarah Yurick <[email protected]>

more indents?

83045d0

Signed-off-by: Sarah Yurick <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for Chinese and Japanese stop words #507

Add support for Chinese and Japanese stop words #507

sarahyurick commented Jan 31, 2025

sarahyurick Jan 31, 2025

sarahyurick Jan 31, 2025

Add support for Chinese and Japanese stop words #507

Are you sure you want to change the base?

Add support for Chinese and Japanese stop words #507

Conversation

sarahyurick commented Jan 31, 2025

sarahyurick Jan 31, 2025

Choose a reason for hiding this comment

sarahyurick Jan 31, 2025

Choose a reason for hiding this comment