Skip to content

Commit 640546c

Browse files
authored
Update README.md (#126)
* Improve and shorten. * Include links to blogs and tutorials. * Remove incorrect info about non-existent branches. * Add a header and a diagram. * Add a note about incremental deduplication. Signed-off-by: Mehran Maghoumi <[email protected]>
1 parent 3d57926 commit 640546c

File tree

4 files changed

+99
-50
lines changed

4 files changed

+99
-50
lines changed

README.md

Lines changed: 62 additions & 50 deletions
Original file line numberDiff line numberDiff line change
@@ -1,49 +1,56 @@
1-
# NeMo Curator
1+
<div align="center">
22

3-
NeMo Curator is a Python library specifically designed for scalable and efficient dataset preparation. It greatly accelerates data curation by leveraging GPUs with [Dask](https://www.dask.org/) and [RAPIDS](https://developer.nvidia.com/rapids), resulting in significant time savings. The library provides a customizable and modular interface, simplifying pipeline expansion and accelerating model convergence through the preparation of high-quality tokens.
3+
<a href="https://github.com/NVIDIA/NeMo-Curator/blob/main/LICENSE">![https://pypi.org/project/nemo-curator](https://img.shields.io/github/license/NVIDIA/NeMo-Curator)</a>
4+
<a href="https://pypi.org/project/nemo-curator/">![https://pypi.org/project/nemo-curator/](https://img.shields.io/pypi/pyversions/nemo-curator.svg)</a>
5+
<a href="https://github.com/NVIDIA/NeMo-Curator/graphs/contributors">![NVIDIA/NeMo-Curator](https://img.shields.io/github/contributors/NVIDIA/NeMo-Curator)</a>
6+
<a href="https://github.com/NVIDIA/NeMo-Curator/releases">![https://github.com/NVIDIA/NeMo-Curator/releases](https://img.shields.io/github/release/NVIDIA/NeMo-Curator)</a>
7+
<a href="https://pypi.org/project/nemo-curator/">![https://github.com/Naereen/badges/](https://badgen.net/badge/open%20source/❤/blue?icon=github)</a>
48

5-
At the core of the NeMo Curator is the `DocumentDataset` which serves as the the main dataset class. It acts as a straightforward wrapper around a Dask `DataFrame`. The Python library offers easy-to-use methods for expanding the functionality of your curation pipeline while eliminating scalability concerns.
9+
</div>
610

7-
## Key Features
11+
# NeMo Curator
12+
🚀 **The GPU-Accelerated Open Source Framework for Efficient Large Language Model Data Curation** 🚀
813

9-
NeMo Curator provides a collection of scalable data-mining modules. Some of the key features include:
14+
<p align="center">
15+
<img src="./docs/user-guide/images/diagram.png" alt="diagram"/>
16+
</p>
1017

11-
[Data download and text extraction](docs/user-guide/download.rst)
18+
NeMo Curator is a Python library specifically designed for fast and scalable dataset preparation and curation for [large language model (LLM)](https://www.nvidia.com/en-us/glossary/large-language-models/) use-cases such as foundation model pretraining, domain-adaptive pretraining (DAPT), supervised fine-tuning (SFT) and paramter-efficient fine-tuning (PEFT). It greatly accelerates data curation by leveraging GPUs with [Dask](https://www.dask.org/) and [RAPIDS](https://developer.nvidia.com/rapids), resulting in significant time savings. The library provides a customizable and modular interface, simplifying pipeline expansion and accelerating model convergence through the preparation of high-quality tokens.
1219

13-
- Default implementations for downloading and extracting Common Crawl, Wikipedia, and ArXiv data
14-
- Easily customize the download and extraction and extend to other datasets
20+
At the core of the NeMo Curator is the `DocumentDataset` which serves as the the main dataset class. It acts as a straightforward wrapper around a Dask `DataFrame`. The Python library offers easy-to-use methods for expanding the functionality of your curation pipeline while eliminating scalability concerns.
1521

16-
[Language identification and separation](docs/user-guide/languageidentificationunicodeformatting.rst)
22+
## Key Features
1723

18-
- Language identification with [fastText](https://fasttext.cc/docs/en/language-identification.html) and [pycld2](https://pypi.org/project/pycld2/)
24+
NeMo Curator provides a collection of scalable data-mining modules. Some of the key features include:
1925

20-
[Text reformatting and cleaning](docs/user-guide/languageidentificationunicodeformatting.rst)
26+
- [Data download and text extraction](docs/user-guide/download.rst)
2127

22-
- Fix unicode decoding errors via [ftfy](https://ftfy.readthedocs.io/en/latest/)
28+
- Default implementations for downloading and extracting Common Crawl, Wikipedia, and ArXiv data
29+
- Easily customize the download and extraction and extend to other datasets
2330

24-
[Quality filtering](docs/user-guide/qualityfiltering.rst)
31+
- [Language identification and separation](docs/user-guide/languageidentificationunicodeformatting.rst) with [fastText](https://fasttext.cc/docs/en/language-identification.html) and [pycld2](https://pypi.org/project/pycld2/)
2532

26-
- Multilingual heuristic-based filtering
27-
- Classifier-based filtering via [fastText](https://fasttext.cc/)
33+
- [Text reformatting and cleaning](docs/user-guide/languageidentificationunicodeformatting.rst) to fix unicode decoding errors via [ftfy](https://ftfy.readthedocs.io/en/latest/)
2834

29-
[Document-level deduplication](docs/user-guide/gpudeduplication.rst)
35+
- [Quality filtering](docs/user-guide/qualityfiltering.rst)
3036

31-
- Both exact and fuzzy deduplication are accelerated using cuDF and Dask
32-
- For fuzzy deduplication, our implementation follows the method described in [Microsoft Turing NLG 530B](https://arxiv.org/abs/2201.11990)
37+
- Multilingual heuristic-based filtering
38+
- Classifier-based filtering via [fastText](https://fasttext.cc/)
3339

34-
[Multilingual downstream-task decontamination](docs/user-guide/taskdecontamination.rst)
40+
- [Document-level deduplication](docs/user-guide/gpudeduplication.rst)
3541

36-
- Our implementation follows the approach of [OpenAI GPT3](https://arxiv.org/pdf/2005.14165.pdf) and [Microsoft Turing NLG 530B](https://arxiv.org/abs/2201.11990)
42+
- Both exact and fuzzy (near-identical) deduplication are accelerated using cuDF and Dask
43+
- For fuzzy deduplication, our implementation follows the method described in [Microsoft Turing NLG 530B](https://arxiv.org/abs/2201.11990)
3744

38-
[Distributed data classification](docs/user-guide/distributeddataclassification.rst)
45+
- [Multilingual downstream-task decontamination](docs/user-guide/taskdecontamination.rst) following the approach of [OpenAI GPT3](https://arxiv.org/pdf/2005.14165.pdf) and [Microsoft Turing NLG 530B](https://arxiv.org/abs/2201.11990)
3946

40-
- Multi-node, multi-GPU classifier inference
41-
- Provides sophisticated domain and quality classification
42-
- Flexible interface for extending to your own classifier network
47+
- [Distributed data classification](docs/user-guide/distributeddataclassification.rst)
4348

44-
[Personal identifiable information (PII) redaction](docs/user-guide/personalidentifiableinformationidentificationandremoval.rst)
49+
- Multi-node, multi-GPU classifier inference
50+
- Provides sophisticated domain and quality classification
51+
- Flexible interface for extending to your own classifier network
4552

46-
- Identification tools for removing addresses, credit card numbers, social security numbers, and more
53+
- [Personal identifiable information (PII) redaction](docs/user-guide/personalidentifiableinformationidentificationandremoval.rst) for removing addresses, credit card numbers, social security numbers, and more
4754

4855
These modules offer flexibility and permit reordering, with only a few exceptions. In addition, the [NeMo Framework Launcher](https://github.com/NVIDIA/NeMo-Megatron-Launcher) provides pre-built pipelines that can serve as a foundation for your customization use cases.
4956

@@ -52,12 +59,18 @@ These modules offer flexibility and permit reordering, with only a few exception
5259
- [Documentation](docs/)
5360
- [Examples](examples/)
5461
- [Tutorials](tutorials/)
62+
- Blog posts
63+
- [Curating Trillion-Token Datasets: Introducing NVIDIA NeMo Data Curator](https://developer.nvidia.com/blog/curating-trillion-token-datasets-introducing-nemo-data-curator/)
64+
- [Scale and Curate High-Quality Datasets for LLM Training with NVIDIA NeMo Curator](https://developer.nvidia.com/blog/scale-and-curate-high-quality-datasets-for-llm-training-with-nemo-curator/)
65+
- [Curating Custom Datasets for LLM Training with NVIDIA NeMo Curator](https://developer.nvidia.com/blog/curating-custom-datasets-for-llm-training-with-nvidia-nemo-curator/)
5566

5667
## Get Started
5768

5869
This section explains how to install NeMo Curator and use the Python library, Python modules, and CLI scripts. It also includes a list of tutorials to help you get started right away. Finally, this section explains how to use the NeMo Framework Launcher as an alternative method for interfacing with NeMo Curator.
5970

60-
## Requirements
71+
### Install NeMo Curator
72+
73+
#### Requirements
6174

6275
Before installing NeMo Curator, ensure that the following requirements are met:
6376

@@ -67,13 +80,9 @@ Before installing NeMo Curator, ensure that the following requirements are met:
6780
- Volta™ or higher ([compute capability 7.0+](https://developer.nvidia.com/cuda-gpus))
6881
- CUDA 12 (or above)
6982

70-
## Install NeMo Curator
71-
7283
You can install NeMo-Curator from PyPi, from source or get it through the NeMo Framework container.
7384

74-
### PyPi
75-
76-
NeMo Curator can be installed via PyPi as follows -
85+
#### From PyPi
7786

7887
To install the CPU-only modules:
7988

@@ -87,7 +96,7 @@ To install the CPU and CUDA-accelerated modules:
8796
pip install --extra-index-url https://pypi.nvidia.com nemo-curator[cuda12x]
8897
```
8998

90-
### From Source
99+
#### From Source
91100

92101
1. Clone the NeMo Curator repository in GitHub.
93102

@@ -110,18 +119,17 @@ pip install --extra-index-url https://pypi.nvidia.com nemo-curator[cuda12x]
110119
pip install --extra-index-url https://pypi.nvidia.com ".[cuda12x]"
111120
```
112121

113-
### Install from the NeMo Framework Container
114-
115-
NeMo Curator is available in the [NeMo Framework Container](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/nemo/tags). The latest release of NeMo Curator comes preinstalled in the container.
122+
#### From the NeMo Framework Container
116123

117-
If you want the latest commit inside the container, uninstall the existing version using:
124+
The latest release of NeMo Curator comes preinstalled in the [NeMo Framework Container](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/nemo/tags). If you want the latest commit inside the container, uninstall the existing version using:
118125

119126
```bash
120127
pip uninstall nemo-curator
121128
```
122129
And follow the instructions for installing from source from [above](#from-source).
123130

124-
## Use the Python Library
131+
## Use NeMo Curator
132+
### Python API Quick Example
125133

126134
The following snippet demonstrates how to create a small data curation pipeline that downloads and curates a small subset of the Common Crawl dataset.
127135

@@ -130,32 +138,38 @@ The following snippet demonstrates how to create a small data curation pipeline
130138
dataset = download_common_crawl("/datasets/common_crawl/", "2021-04", "2021-10", url_limit=10)
131139
# Build your pipeline
132140
curation_pipeline = Sequential([
141+
# Fix unicode
133142
Modify(UnicodeReformatter()),
143+
# Discard short records
134144
ScoreFilter(WordCountFilter(min_words=80)),
145+
# Discard low-quality records
135146
ScoreFilter(FastTextQualityFilter(model_path="model.bin")),
147+
# Discard records from the evaluation metrics to prevent test set leakage.
136148
TaskDecontamination([Winogrande(), Squad(), TriviaQA()])
137149
])
138-
# Curate your dataset
150+
# Execute the pipeline on your dataset
139151
curated_dataset = curation_pipeline(dataset)
140152
```
141153

142-
## Explore NeMo Curator Tutorials
154+
### Explore NeMo Curator Tutorials
143155

144-
To get started with NeMo Curator, you can follow the tutorials available here: [Tutorials]
145-
(https://github.com/NVIDIA/NeMo-Curator/tree/main/tutorials). These tutorials include:
156+
To get started with NeMo Curator, you can follow the tutorials [available here](https://github.com/NVIDIA/NeMo-Curator/tree/main/tutorials). These tutorials include:
146157

147-
- `tinystories` which focuses on data curation for training from scratch.
148-
- `peft-curation` which focuses on data curation for parameter-efficient fine-tuning use-cases.
158+
- [`tinystories`](https://github.com/NVIDIA/NeMo-Curator/tree/main/tutorials/tinystories) which focuses on data curation for training LLMs from scratch.
159+
- [`peft-curation`](https://github.com/NVIDIA/NeMo-Curator/tree/main/tutorials/peft-curation) which focuses on data curation for LLM parameter-efficient fine-tuning (PEFT) use-cases.
160+
- [`distributed_data_classification`](https://github.com/NVIDIA/NeMo-Curator/tree/main/tutorials/distributed_data_classification) which focuses on using the quality and domain classifiers to help with data annotation.
161+
- [`single_node_tutorial`](https://github.com/NVIDIA/NeMo-Curator/tree/main/tutorials/single_node_tutorial) which demonstrates an end-to-end data curation pipeline for curating Wikipedia data in Thai.
149162

150-
## Access Python Modules
151163

152-
The Data Curation section of the [NeMo Framework User Guide](https://docs.nvidia.com/nemo-framework/user-guide/latest/datacuration/index.html) provides in-depth information about how the Python modules work. The [examples](examples/) directory in the GitHub repository provides scripts that showcase these modules.
164+
### Access Python Modules
153165

154-
## Use CLI Scripts
166+
The NeMo Curator section of the [NeMo Framework User Guide](https://docs.nvidia.com/nemo-framework/user-guide/latest/datacuration/index.html) provides in-depth information about how the Python modules work. The [examples](examples/) directory in the GitHub repository provides scripts that showcase these modules.
167+
168+
### Use CLI Scripts
155169

156170
NeMo Curator also offers CLI scripts for you to use. The scripts in `nemo_curator/scripts` map closely to the supplied Python modules. Refer to the [NeMo Framework User Guide](https://docs.nvidia.com/nemo-framework/user-guide/latest/datacuration/index.html) for more information about the Python modules and scripts.
157171

158-
## Use NeMo Framework Launcher
172+
### Use NeMo Framework Launcher
159173

160174
As an alternative method for interfacing with NeMo Curator, you can use the [NeMo Framework Launcher](https://github.com/NVIDIA/NeMo-Megatron-Launcher). The launcher enables you to easily configure the parameters and cluster. It can also automatically generate the SLURM batch scripts that wrap around the CLI scripts required to run your pipeline.
161175

@@ -211,5 +225,3 @@ Additionally, using the CPU-based modules, the following table shows the time re
211225
## Contribute to NeMo Curator
212226

213227
We welcome community contributions! Please refer to [CONTRIBUTING.md](https://github.com/NVIDIA/NeMo/blob/stable/CONTRIBUTING.md) for the process.
214-
215-
To contribute an article to the collection, please submit a pull request to the ``gh-pages-src`` branch of this repository. For detailed information, please consult the README located at the [gh-pages-src branch](https://github.com/NVIDIA/NeMo/tree/gh-pages-src#readme).

docs/user-guide/gpudeduplication.rst

Lines changed: 33 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -160,6 +160,39 @@ steps (all scripts are included in the :code:`nemo_curator/scripts/` subdirector
160160
--jaccard-threshold 0.8
161161
# --scheduler-file /path/to/file.json
162162
163+
* Incremental Fuzzy Dedup
164+
To incrementally perform fuzzy dedup, organize your incremental dataset snapshots into separate directories and pass a list of all your directories to :code:`gpu_compute_minhashes`. All other subsequent steps can be done as described above without modification.
165+
166+
- Input (assuming incremental snapshots are all under :code:`/input/`):
167+
168+
.. code-block:: bash
169+
170+
/input/cc-2020-40
171+
/input/cc-2021-42
172+
/input/cc-2022-60
173+
- Output (assuming :code:`--output-minhash-dir=/output`):
174+
175+
.. code-block:: bash
176+
177+
/output/cc-2020-40/minhashes.parquet
178+
/output/cc-2021-42/minhashes.parquet
179+
/output/cc-2022-60/minhashes.parquet
180+
- Example call:
181+
182+
.. code-block:: bash
183+
184+
# same as `python compute_minhashes.py`
185+
gpu_compute_minhashes \
186+
--input-data-dirs /input/cc-2020-40 /input/cc-2020-42 /input/cc-2020-60 \
187+
--output-minhash-dir /output/ \
188+
--input-json-text-field text_column_name \
189+
--input-json-id-field id_column_name \
190+
--minhash-length number_of_hashes \
191+
--char-ngram char_ngram_size \
192+
--hash-bytes 4(or 8 byte hashes) \
193+
--seed 42 \
194+
--log-dir ./
195+
# --scheduler-file /path/to/file.json
163196
164197
In addition to the scripts, there are examples in the `examples` directory that showcase using the python module
165198
directly in your own code. It also has examples on how to remove documents from the corpus using the list of duplicate IDs generated from exact or fuzzy

docs/user-guide/images/diagram.png

223 KB
Loading

tutorials/tinystories/README.md

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -4,6 +4,10 @@ This tutorial demonstrates the usage of NeMo Curator's Python API to curate the
44

55
For simplicity, this tutorial uses the validation split of this dataset, which contains around 22,000 samples.
66

7+
## Walkthrough
8+
For a detailed walkthrough of this tutorial, please see this [blog post](https://developer.nvidia.com/blog/curating-custom-datasets-for-llm-training-with-nvidia-nemo-curator/).
9+
10+
711
## Usage
812
After installing the NeMo Curator package, you can simply run the following command:
913
```

0 commit comments

Comments
 (0)