docs: readme refresher #758

lbliii · 2025-06-30T19:25:39Z

Updates based on GH ReadMe SEO exercise.

Signed-off-by: Lawrence Lane <[email protected]>

copy-pr-bot · 2025-06-30T19:25:42Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

Signed-off-by: Lawrence Lane <[email protected]>

nirmal-kumar

The changes look good. Approving.

README.md

Signed-off-by: Lawrence Lane <[email protected]>

abhinavg4

Looks good. Minor change

README.md

praateekmahajan · 2025-07-14T15:39:23Z

nemo_curator/download/arxiv.py

@@ -39,65 +39,6 @@
 # https://github.com/togethercomputer/RedPajama-Data/tree/main/data_prep/arxiv


-def _is_safe_path(path: str, base_path: str) -> bool:


Why has this changed?

i think this messed up due to a rebase, sorry. it's just the readme...

@praateekmahajan okay i fixed the branch

Signed-off-by: L.B. <[email protected]>

Signed-off-by: Lawrence Lane <[email protected]>

sarahyurick · 2025-07-15T15:57:46Z

README.md

+With NeMo Curator, you can process raw data and curate high-quality data for training and customizing generative AI models such as LLMs, VLMs and WFMs. NeMo Curator provides a collection of scalable data processing modules for text and image curation.
+
+### Text Data Processing
+All of our text pipelines have great multilingual support. With NeMo Curator, you can pick and choose the features you want and build your data processing pipelines. There may be several stages of data processing, depending on your use case. A typical data processing pipeline for text consists of downloading the raw data from public resources or extracting it from internal documents. It's then followed by performing cleaning steps such as fixing Unicode character, removing low-quality documents.


Is this true? All text pipelines support multilingual?

sarahyurick · 2025-07-15T15:58:16Z

README.md

+With NeMo Curator, you can process raw data and curate high-quality data for training and customizing generative AI models such as LLMs, VLMs and WFMs. NeMo Curator provides a collection of scalable data processing modules for text and image curation.
+
+### Text Data Processing
+All of our text pipelines have great multilingual support. With NeMo Curator, you can pick and choose the features you want and build your data processing pipelines. There may be several stages of data processing, depending on your use case. A typical data processing pipeline for text consists of downloading the raw data from public resources or extracting it from internal documents. It's then followed by performing cleaning steps such as fixing Unicode character, removing low-quality documents.


Suggested change

All of our text pipelines have great multilingual support. With NeMo Curator, you can pick and choose the features you want and build your data processing pipelines. There may be several stages of data processing, depending on your use case. A typical data processing pipeline for text consists of downloading the raw data from public resources or extracting it from internal documents. It's then followed by performing cleaning steps such as fixing Unicode character, removing low-quality documents.

All of our text pipelines have great multilingual support. With NeMo Curator, you can pick and choose the features you want and build your data processing pipelines. There may be several stages of data processing, depending on your use case. A typical data processing pipeline for text consists of downloading the raw data from public resources or extracting it from internal documents. It's then followed by performing cleaning steps such as fixing Unicode characters and removing low-quality documents.

sarahyurick · 2025-07-15T16:00:25Z

README.md

-```bash
-pip install --extra-index-url https://pypi.nvidia.com nemo-curator[all]
-```
+  - [Exact Deduplication](https://docs.nvidia.com/nemo-framework/user-guide/latest/datacuration/gpudeduplication.html) - refers to removing identical documents (i.e., document strings that are equal) from the dataset. As exact deduplication requires significantly less compute, we typically will run exact deduplication before fuzzy deduplication. Also, from our experience in deduplicating Common Crawl snapshots, a significant portion (as high as ~40%) of the duplicates can be exact duplicates.


Suggested change

- [Exact Deduplication](https://docs.nvidia.com/nemo-framework/user-guide/latest/datacuration/gpudeduplication.html) - refers to removing identical documents (i.e., document strings that are equal) from the dataset. As exact deduplication requires significantly less compute, we typically will run exact deduplication before fuzzy deduplication. Also, from our experience in deduplicating Common Crawl snapshots, a significant portion (as high as ~40%) of the duplicates can be exact duplicates.

- [Exact Deduplication](https://docs.nvidia.com/nemo-framework/user-guide/latest/datacuration/gpudeduplication.html) refers to removing identical documents (i.e., document strings that are equal) from the dataset. As exact deduplication requires significantly less compute, we typically will run exact deduplication before fuzzy deduplication. Also, from our experience in deduplicating Common Crawl snapshots, a significant portion (as high as ~40%) of the duplicates can be exact duplicates.

sarahyurick · 2025-07-15T16:00:39Z

README.md

-```
+  - [Exact Deduplication](https://docs.nvidia.com/nemo-framework/user-guide/latest/datacuration/gpudeduplication.html) - refers to removing identical documents (i.e., document strings that are equal) from the dataset. As exact deduplication requires significantly less compute, we typically will run exact deduplication before fuzzy deduplication. Also, from our experience in deduplicating Common Crawl snapshots, a significant portion (as high as ~40%) of the duplicates can be exact duplicates.
+  - [Fuzzy Deduplication](https://docs.nvidia.com/nemo-framework/user-guide/latest/datacuration/gpudeduplication.html) via MinHash Locality Sensitive Hashing with optional False Positive Check
+  - [Semantic Deduplication](https://docs.nvidia.com/nemo-framework/user-guide/latest/datacuration/semdedup.html) - NeMo Curator provides scalable and GPU accelerated semantic deduplication functionality using RAPIDS cuML, cuDF, crossfit and PyTorch.


Suggested change

- [Semantic Deduplication](https://docs.nvidia.com/nemo-framework/user-guide/latest/datacuration/semdedup.html) - NeMo Curator provides scalable and GPU accelerated semantic deduplication functionality using RAPIDS cuML, cuDF, crossfit and PyTorch.

- [Semantic Deduplication](https://docs.nvidia.com/nemo-framework/user-guide/latest/datacuration/semdedup.html) - NeMo Curator provides scalable and GPU-accelerated semantic deduplication functionality using RAPIDS cuML, cuDF, CrossFit and PyTorch.

sarahyurick · 2025-07-15T16:01:22Z

README.md

+  - Once you download the content, you can process it with NeMo Curator and convert it into JSONL or Parquet format for easier data processing.
+- **[Language Identification](https://docs.nvidia.com/nemo-framework/user-guide/latest/datacuration/languageidentification.html)**
+  - NeMo Curator provides utilities to identify languages using fastText. Even though a preliminary language identification may have been performed on the unextracted text (as is the case in our Common Crawl pipeline using pyCLD2), fastText is more accurate so it can be used for a second pass.
+- **[Text Cleaning](https://docs.nvidia.com/nemo-framework/user-guide/latest/datacuration/textcleaning.html)**


Page not found.

sarahyurick · 2025-07-15T16:02:25Z

README.md

-### Access Python Modules
-
-The NeMo Curator section of the [NeMo Framework User Guide](https://docs.nvidia.com/nemo-framework/user-guide/latest/datacuration/index.html) provides in-depth information about how the Python modules work. The [examples](examples/) directory in the GitHub repository provides scripts that showcase these modules.
+- **[Embedding Creation](https://docs.nvidia.com/nemo-framework/user-guide/latest/datacuration/image/classifiers/embedders.html)**


Page not found.

sarahyurick · 2025-07-15T16:03:31Z

README.md

+- [**Generate Synthetic Prompts**](https://docs.nvidia.com/nemo-framework/user-guide/latest/datacuration/syntheticdata.html#generate-synthetic-prompts)
+- [**Generate Open Q&A Prompts**](https://docs.nvidia.com/nemo-framework/user-guide/latest/datacuration/syntheticdata.html#generate-open-q-a-prompts)
+- [**Generate Writing Prompts**](https://docs.nvidia.com/nemo-framework/user-guide/latest/datacuration/syntheticdata.html#generate-writing-prompts)
+- [**Generate Closed Q&A Prompts**](https://docs.nvidia.com/nemo-framework/user-guide/latest/datacuration/syntheticdata.html#generate-closed-q-a-prompts)
+- [**Generate Math Prompts**](https://docs.nvidia.com/nemo-framework/user-guide/latest/datacuration/syntheticdata.html#generate-math-prompts)
+- [**Generate Coding Prompts**](https://docs.nvidia.com/nemo-framework/user-guide/latest/datacuration/syntheticdata.html#generate-coding-prompts)
+- [**Generate Dialogue**](https://docs.nvidia.com/nemo-framework/user-guide/latest/datacuration/syntheticdata.html#generate-dialogue)
+- [**Generate Synthetic Two-Turn Prompts**](https://docs.nvidia.com/nemo-framework/user-guide/latest/datacuration/syntheticdata.html#generate-synthetic-two-turn-prompts)
+- [**Nemotron CC pipeline - Rewrite to Wikipedia Style**](https://docs.nvidia.com/nemo-framework/user-guide/latest/datacuration/syntheticdata.html#rewrite-to-wikipedia-style)
+- [**Nemotron CC pipeline - Knowledge Distillation**](https://docs.nvidia.com/nemo-framework/user-guide/latest/datacuration/syntheticdata.html#rewrite-to-wikipedia-style)


Is it overkill to link to all of these individually? They are all on the same page.

sarahyurick · 2025-07-15T16:04:30Z

README.md

-
-This section explains how to install NeMo Curator and use the Python library, Python modules, and CLI scripts. It also includes a list of tutorials to help you get started right away. Finally, this section explains how to use the NeMo Framework Launcher as an alternative method for interfacing with NeMo Curator.
-
-### Install NeMo Curator


Would prefer keeping this section in the README. We regularly point users here for instructions on various ways to install...

readme refresher

5f1fe61

Signed-off-by: Lawrence Lane <[email protected]>

links added

d100747

Signed-off-by: Lawrence Lane <[email protected]>

lbliii marked this pull request as draft July 1, 2025 15:21

nirmal-kumar approved these changes Jul 1, 2025

View reviewed changes

README.md Show resolved Hide resolved

README.md Show resolved Hide resolved

README.md Show resolved Hide resolved

updated chart

e710707

Signed-off-by: Lawrence Lane <[email protected]>

lbliii marked this pull request as ready for review July 1, 2025 16:33

arhamm1 assigned arhamm1 and unassigned arhamm1 Jul 11, 2025

arhamm1 approved these changes Jul 11, 2025

View reviewed changes

arhamm1 requested a review from abhinavg4 July 11, 2025 20:07

abhinavg4 approved these changes Jul 11, 2025

View reviewed changes

README.md Show resolved Hide resolved

praateekmahajan reviewed Jul 14, 2025

View reviewed changes

lbliii force-pushed the llane/readme-rewrite branch from c507f0f to e710707 Compare July 14, 2025 15:45

lbliii added 2 commits July 14, 2025 11:47

Merge branch 'main' into llane/readme-rewrite

76418f9

Signed-off-by: L.B. <[email protected]>

contribute

a3b0314

Signed-off-by: Lawrence Lane <[email protected]>

ayushdg added the ray-api Pick this label for auto-cherry-picking into the ray-api branch label Jul 14, 2025

sarahyurick requested changes Jul 15, 2025

View reviewed changes

		@@ -39,65 +39,6 @@
		# https://github.com/togethercomputer/RedPajama-Data/tree/main/data_prep/arxiv


		def _is_safe_path(path: str, base_path: str) -> bool:

	- [Semantic Deduplication](https://docs.nvidia.com/nemo-framework/user-guide/latest/datacuration/semdedup.html) - NeMo Curator provides scalable and GPU accelerated semantic deduplication functionality using RAPIDS cuML, cuDF, crossfit and PyTorch.
	- [Semantic Deduplication](https://docs.nvidia.com/nemo-framework/user-guide/latest/datacuration/semdedup.html) - NeMo Curator provides scalable and GPU-accelerated semantic deduplication functionality using RAPIDS cuML, cuDF, CrossFit and PyTorch.


		This section explains how to install NeMo Curator and use the Python library, Python modules, and CLI scripts. It also includes a list of tutorials to help you get started right away. Finally, this section explains how to use the NeMo Framework Launcher as an alternative method for interfacing with NeMo Curator.

		### Install NeMo Curator

docs: readme refresher #758

Are you sure you want to change the base?

docs: readme refresher #758

Uh oh!

Conversation

lbliii commented Jun 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

copy-pr-bot bot commented Jun 30, 2025

Uh oh!

nirmal-kumar left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

abhinavg4 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

lbliii commented Jun 30, 2025 •

edited

Loading