Skip to content

docs: readme refresher #758

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

lbliii
Copy link
Contributor

@lbliii lbliii commented Jun 30, 2025

Updates based on GH ReadMe SEO exercise.

Signed-off-by: Lawrence Lane <[email protected]>
Copy link

copy-pr-bot bot commented Jun 30, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

Signed-off-by: Lawrence Lane <[email protected]>
@lbliii lbliii marked this pull request as draft July 1, 2025 15:21
Copy link

@nirmal-kumar nirmal-kumar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changes look good. Approving.

Signed-off-by: Lawrence Lane <[email protected]>
@lbliii lbliii marked this pull request as ready for review July 1, 2025 16:33
@arhamm1 arhamm1 assigned arhamm1 and unassigned arhamm1 Jul 11, 2025
@arhamm1 arhamm1 requested a review from abhinavg4 July 11, 2025 20:07
Copy link
Contributor

@abhinavg4 abhinavg4 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. Minor change

@@ -39,65 +39,6 @@
# https://github.com/togethercomputer/RedPajama-Data/tree/main/data_prep/arxiv


def _is_safe_path(path: str, base_path: str) -> bool:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why has this changed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think this messed up due to a rebase, sorry. it's just the readme...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@praateekmahajan okay i fixed the branch

@lbliii lbliii force-pushed the llane/readme-rewrite branch from c507f0f to e710707 Compare July 14, 2025 15:45
@ayushdg ayushdg added the ray-api Pick this label for auto-cherry-picking into the ray-api branch label Jul 14, 2025
With NeMo Curator, you can process raw data and curate high-quality data for training and customizing generative AI models such as LLMs, VLMs and WFMs. NeMo Curator provides a collection of scalable data processing modules for text and image curation.

### Text Data Processing
All of our text pipelines have great multilingual support. With NeMo Curator, you can pick and choose the features you want and build your data processing pipelines. There may be several stages of data processing, depending on your use case. A typical data processing pipeline for text consists of downloading the raw data from public resources or extracting it from internal documents. It's then followed by performing cleaning steps such as fixing Unicode character, removing low-quality documents.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this true? All text pipelines support multilingual?

With NeMo Curator, you can process raw data and curate high-quality data for training and customizing generative AI models such as LLMs, VLMs and WFMs. NeMo Curator provides a collection of scalable data processing modules for text and image curation.

### Text Data Processing
All of our text pipelines have great multilingual support. With NeMo Curator, you can pick and choose the features you want and build your data processing pipelines. There may be several stages of data processing, depending on your use case. A typical data processing pipeline for text consists of downloading the raw data from public resources or extracting it from internal documents. It's then followed by performing cleaning steps such as fixing Unicode character, removing low-quality documents.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
All of our text pipelines have great multilingual support. With NeMo Curator, you can pick and choose the features you want and build your data processing pipelines. There may be several stages of data processing, depending on your use case. A typical data processing pipeline for text consists of downloading the raw data from public resources or extracting it from internal documents. It's then followed by performing cleaning steps such as fixing Unicode character, removing low-quality documents.
All of our text pipelines have great multilingual support. With NeMo Curator, you can pick and choose the features you want and build your data processing pipelines. There may be several stages of data processing, depending on your use case. A typical data processing pipeline for text consists of downloading the raw data from public resources or extracting it from internal documents. It's then followed by performing cleaning steps such as fixing Unicode characters and removing low-quality documents.

```bash
pip install --extra-index-url https://pypi.nvidia.com nemo-curator[all]
```
- [Exact Deduplication](https://docs.nvidia.com/nemo-framework/user-guide/latest/datacuration/gpudeduplication.html) - refers to removing identical documents (i.e., document strings that are equal) from the dataset. As exact deduplication requires significantly less compute, we typically will run exact deduplication before fuzzy deduplication. Also, from our experience in deduplicating Common Crawl snapshots, a significant portion (as high as ~40%) of the duplicates can be exact duplicates.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- [Exact Deduplication](https://docs.nvidia.com/nemo-framework/user-guide/latest/datacuration/gpudeduplication.html) - refers to removing identical documents (i.e., document strings that are equal) from the dataset. As exact deduplication requires significantly less compute, we typically will run exact deduplication before fuzzy deduplication. Also, from our experience in deduplicating Common Crawl snapshots, a significant portion (as high as ~40%) of the duplicates can be exact duplicates.
- [Exact Deduplication](https://docs.nvidia.com/nemo-framework/user-guide/latest/datacuration/gpudeduplication.html) refers to removing identical documents (i.e., document strings that are equal) from the dataset. As exact deduplication requires significantly less compute, we typically will run exact deduplication before fuzzy deduplication. Also, from our experience in deduplicating Common Crawl snapshots, a significant portion (as high as ~40%) of the duplicates can be exact duplicates.

```
- [Exact Deduplication](https://docs.nvidia.com/nemo-framework/user-guide/latest/datacuration/gpudeduplication.html) - refers to removing identical documents (i.e., document strings that are equal) from the dataset. As exact deduplication requires significantly less compute, we typically will run exact deduplication before fuzzy deduplication. Also, from our experience in deduplicating Common Crawl snapshots, a significant portion (as high as ~40%) of the duplicates can be exact duplicates.
- [Fuzzy Deduplication](https://docs.nvidia.com/nemo-framework/user-guide/latest/datacuration/gpudeduplication.html) via MinHash Locality Sensitive Hashing with optional False Positive Check
- [Semantic Deduplication](https://docs.nvidia.com/nemo-framework/user-guide/latest/datacuration/semdedup.html) - NeMo Curator provides scalable and GPU accelerated semantic deduplication functionality using RAPIDS cuML, cuDF, crossfit and PyTorch.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- [Semantic Deduplication](https://docs.nvidia.com/nemo-framework/user-guide/latest/datacuration/semdedup.html) - NeMo Curator provides scalable and GPU accelerated semantic deduplication functionality using RAPIDS cuML, cuDF, crossfit and PyTorch.
- [Semantic Deduplication](https://docs.nvidia.com/nemo-framework/user-guide/latest/datacuration/semdedup.html) - NeMo Curator provides scalable and GPU-accelerated semantic deduplication functionality using RAPIDS cuML, cuDF, CrossFit and PyTorch.

- Once you download the content, you can process it with NeMo Curator and convert it into JSONL or Parquet format for easier data processing.
- **[Language Identification](https://docs.nvidia.com/nemo-framework/user-guide/latest/datacuration/languageidentification.html)**
- NeMo Curator provides utilities to identify languages using fastText. Even though a preliminary language identification may have been performed on the unextracted text (as is the case in our Common Crawl pipeline using pyCLD2), fastText is more accurate so it can be used for a second pass.
- **[Text Cleaning](https://docs.nvidia.com/nemo-framework/user-guide/latest/datacuration/textcleaning.html)**
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Page not found.

### Access Python Modules

The NeMo Curator section of the [NeMo Framework User Guide](https://docs.nvidia.com/nemo-framework/user-guide/latest/datacuration/index.html) provides in-depth information about how the Python modules work. The [examples](examples/) directory in the GitHub repository provides scripts that showcase these modules.
- **[Embedding Creation](https://docs.nvidia.com/nemo-framework/user-guide/latest/datacuration/image/classifiers/embedders.html)**
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Page not found.

Comment on lines +71 to +80
- [**Generate Synthetic Prompts**](https://docs.nvidia.com/nemo-framework/user-guide/latest/datacuration/syntheticdata.html#generate-synthetic-prompts)
- [**Generate Open Q&A Prompts**](https://docs.nvidia.com/nemo-framework/user-guide/latest/datacuration/syntheticdata.html#generate-open-q-a-prompts)
- [**Generate Writing Prompts**](https://docs.nvidia.com/nemo-framework/user-guide/latest/datacuration/syntheticdata.html#generate-writing-prompts)
- [**Generate Closed Q&A Prompts**](https://docs.nvidia.com/nemo-framework/user-guide/latest/datacuration/syntheticdata.html#generate-closed-q-a-prompts)
- [**Generate Math Prompts**](https://docs.nvidia.com/nemo-framework/user-guide/latest/datacuration/syntheticdata.html#generate-math-prompts)
- [**Generate Coding Prompts**](https://docs.nvidia.com/nemo-framework/user-guide/latest/datacuration/syntheticdata.html#generate-coding-prompts)
- [**Generate Dialogue**](https://docs.nvidia.com/nemo-framework/user-guide/latest/datacuration/syntheticdata.html#generate-dialogue)
- [**Generate Synthetic Two-Turn Prompts**](https://docs.nvidia.com/nemo-framework/user-guide/latest/datacuration/syntheticdata.html#generate-synthetic-two-turn-prompts)
- [**Nemotron CC pipeline - Rewrite to Wikipedia Style**](https://docs.nvidia.com/nemo-framework/user-guide/latest/datacuration/syntheticdata.html#rewrite-to-wikipedia-style)
- [**Nemotron CC pipeline - Knowledge Distillation**](https://docs.nvidia.com/nemo-framework/user-guide/latest/datacuration/syntheticdata.html#rewrite-to-wikipedia-style)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it overkill to link to all of these individually? They are all on the same page.


This section explains how to install NeMo Curator and use the Python library, Python modules, and CLI scripts. It also includes a list of tutorials to help you get started right away. Finally, this section explains how to use the NeMo Framework Launcher as an alternative method for interfacing with NeMo Curator.

### Install NeMo Curator
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would prefer keeping this section in the README. We regularly point users here for instructions on various ways to install...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ray-api Pick this label for auto-cherry-picking into the ray-api branch
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants