Add support for Chinese and Japanese stop words #507

sarahyurick · 2025-01-31T21:02:42Z

Closes #459.

Related PRs:

Signed-off-by: Sarah Yurick <[email protected]>

sarahyurick · 2025-01-31T21:04:58Z

nemo_curator/download/commoncrawl.py

+      stop_lists: A dictionary stop lists, where the keys are languages (e.g., "ENGLISH")
+        and the values are Python frozensets denoting the list of stop words for that language.
+        If None, it defaults to jusText's stop lists: https://github.com/miso-belica/jusText/tree/main/justext/stoplists,
+        with added Thai, Chinese, and Japanese support.


While I think it's important for NeMo Curator to support Thai, Chinese, and Japanese, I also think it would be a good idea for us to allow users to pass in their own stop lists as a workaround.

This way, if a language is not already supported, the user can do it themselves. Additionally, a user might not like the stop lists provided by jusText and want to pass in their own custom stop lists for that reason, too.

sarahyurick · 2025-01-31T21:08:24Z

nemo_curator/download/commoncrawl.py

@@ -128,6 +128,7 @@ def extract_text(self, html, stop_words):
        paragraphs = handler.paragraphs

        # Context free classification
+        # TODO: Check Thai, Chinese, Japanese, and Korean


Words in Thai, Chinese, Japanese, and Korean are not separated by spaces. I need to make sure to either (1) raise a warning about the stop word logic used by our jusText/Resiliparse extractors (such as suggesting to modify the stopwords_low/stopwords_high/required_stopword_density parameters?) or (2) adding word splitting logic like I did for #320.

Signed-off-by: Sarah Yurick <[email protected]>

ryantwolf

A few minor points.

ryantwolf · 2025-03-06T18:07:59Z

nemo_curator/download/commoncrawl.py


-    stop_list_dict["THAI"] = thai_stopwords
+        if lang_key in ["THAI", "CHINESE", "JAPANESE"]:


Nit: Make this conditional if lang_key in custom_stopwords

ryantwolf · 2025-03-06T18:09:01Z

nemo_curator/download/commoncrawl.py

@@ -152,7 +158,19 @@ def extract_text(self, html, stop_words):
            self.max_heading_distance,
        )

-        return [p.text for p in paragraphs if not p.is_boilerplate]
+        if self.is_boilerplate is None:
+            if language in ["THAI", "CHINESE", "JAPANESE", "KOREAN"]:


can you abstract away this list of 4 languages to a global var that is shared between all of the extractors?

Updated, thanks!

Signed-off-by: Sarah Yurick <[email protected]>

ryantwolf · 2025-03-06T19:02:10Z

docs/user-guide/download.rst

- 1. Decode the HTML within the record from binary to text.
- 2. If the HTML can be properly decoded, then with `pyCLD2 <https://github.com/aboSamoor/pycld2>`_, perform language detection on the input HTML.
- 3. Finally, the extract the relevant text with `jusText <https://github.com/miso-belica/jusText>`_ or `Resiliparse <https://github.com/chatnoir-eu/chatnoir-resiliparse>`_ from the HTML and write it out as a single string within the 'text' field of a json entry within a `.jsonl` file.
+  1. Decode the HTML within the record from binary to text.


I just took a peak at the rendered version of this page and it looks a little messed up. Can you fix it?

Updated, let me know what you think.

Signed-off-by: Sarah Yurick <[email protected]>

sarahyurick · 2025-03-06T19:40:57Z

nemo_curator/download/commoncrawl.py

+            if language in NON_SPACED_LANGUAGES:
+                warnings.warn(
+                    "stopword_density is ignored for non-space-separated languages."
+                )
+                result = paragraphs


Since #431 was just merged, I added this logic to the Trafilatura extractor. It is the same as the Resiliparse logic.

add zh and ja stopwords

c0278dd

Signed-off-by: Sarah Yurick <[email protected]>

sarahyurick commented Jan 31, 2025

View reviewed changes

sarahyurick and others added 8 commits January 31, 2025 13:11

run isort

f135f3f

Signed-off-by: Sarah Yurick <[email protected]>

edit doc

9789d08

Signed-off-by: Sarah Yurick <[email protected]>

indent?

1977940

Signed-off-by: Sarah Yurick <[email protected]>

rst file

3b94ba7

Signed-off-by: Sarah Yurick <[email protected]>

rst?

f2675bf

Signed-off-by: Sarah Yurick <[email protected]>

more indents?

83045d0

Signed-off-by: Sarah Yurick <[email protected]>

Merge branch 'main' into custom_stopwords

ac38e6e

Signed-off-by: Sarah Yurick <[email protected]>

fix todos and add pytests

fabecf6

Signed-off-by: Sarah Yurick <[email protected]>

sarahyurick marked this pull request as ready for review February 19, 2025 20:47

run black

a1011fe

Signed-off-by: Sarah Yurick <[email protected]>

ryantwolf reviewed Mar 6, 2025

View reviewed changes

sarahyurick and others added 3 commits March 6, 2025 10:25

Merge branch 'main' into custom_stopwords

9bf4063

Signed-off-by: Sarah Yurick <[email protected]>

add Ryan's suggestions

ebd2b81

Signed-off-by: Sarah Yurick <[email protected]>

run isort

37fd964

Signed-off-by: Sarah Yurick <[email protected]>

sarahyurick requested a review from ryantwolf March 6, 2025 18:56

ryantwolf reviewed Mar 6, 2025

View reviewed changes

sarahyurick and others added 3 commits March 6, 2025 11:13

Merge branch 'main' into custom_stopwords

3928475

Signed-off-by: Sarah Yurick <[email protected]>

edit rst file

d104bef

Signed-off-by: Sarah Yurick <[email protected]>

add trafilatura support

cd2e8d0

Signed-off-by: Sarah Yurick <[email protected]>

sarahyurick commented Mar 6, 2025

View reviewed changes

sarahyurick requested a review from ryantwolf March 6, 2025 19:41

ryantwolf approved these changes Mar 7, 2025

View reviewed changes

sarahyurick merged commit 9a2bd42 into NVIDIA:main Mar 7, 2025
4 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for Chinese and Japanese stop words #507

Add support for Chinese and Japanese stop words #507

sarahyurick commented Jan 31, 2025

sarahyurick Jan 31, 2025

sarahyurick Jan 31, 2025

ryantwolf left a comment

ryantwolf Mar 6, 2025

ryantwolf Mar 6, 2025

sarahyurick Mar 6, 2025

ryantwolf Mar 6, 2025

sarahyurick Mar 6, 2025

sarahyurick Mar 6, 2025


		stop_list_dict["THAI"] = thai_stopwords
		if lang_key in ["THAI", "CHINESE", "JAPANESE"]:

Add support for Chinese and Japanese stop words #507

Add support for Chinese and Japanese stop words #507

Conversation

sarahyurick commented Jan 31, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ryantwolf left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment