NVIDIA-NeMo
diff --git a/‎.github/workflows/test.yml
Lines changed: 0 additions & 2 deletions b/‎.github/workflows/test.yml
Lines changed: 0 additions & 2 deletions
diff --git a/‎.pre-commit-config.yaml
Lines changed: 47 additions & 0 deletions b/‎.pre-commit-config.yaml
Lines changed: 47 additions & 0 deletions
diff --git a/‎.style.yapf
Lines changed: 0 additions & 3 deletions b/‎.style.yapf
Lines changed: 0 additions & 3 deletions
diff --git a/‎CONTRIBUTING.md
Lines changed: 1 addition & 1 deletion b/‎CONTRIBUTING.md
Lines changed: 1 addition & 1 deletion
diff --git a/‎README.md
Lines changed: 4 additions & 4 deletions b/‎README.md
Lines changed: 4 additions & 4 deletions
diff --git a/‎SECURITY.md
Lines changed: 1 addition & 1 deletion b/‎SECURITY.md
Lines changed: 1 addition & 1 deletion
diff --git a/‎config/arxiv_builder.yaml
Lines changed: 2 additions & 2 deletions b/‎config/arxiv_builder.yaml
Lines changed: 2 additions & 2 deletions
diff --git a/‎config/cc_warc_builder.yaml
Lines changed: 1 addition & 1 deletion b/‎config/cc_warc_builder.yaml
Lines changed: 1 addition & 1 deletion
diff --git a/‎config/heuristic_filter_code.yaml
Lines changed: 1 addition & 1 deletion b/‎config/heuristic_filter_code.yaml
Lines changed: 1 addition & 1 deletion
diff --git a/‎config/heuristic_filter_en.yaml
Lines changed: 9 additions & 9 deletions b/‎config/heuristic_filter_en.yaml
Lines changed: 9 additions & 9 deletions
@@ -40,5 +40,3 @@ jobs:
         # TODO: Remove env variable when gpu dependencies are optional
         run: |
           RAPIDS_NO_INITIALIZE=1 python -m pytest -v --cpu
-
-
@@ -0,0 +1,47 @@
+# Copyright (c) 2024, NVIDIA CORPORATION.  All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+default_language_version:
+  python: python3
+
+ci:
+  autofix_prs: true
+  autoupdate_commit_msg: '[pre-commit.ci] pre-commit suggestions'
+  autoupdate_schedule: quarterly
+
+repos:
+  - repo: https://github.com/pre-commit/pre-commit-hooks
+    rev: v4.5.0
+    hooks:
+      - id: check-added-large-files
+        args: ['--maxkb=1000']
+      - id: check-case-conflict
+      - id: check-yaml
+      - id: detect-private-key
+      - id: end-of-file-fixer
+      - id: requirements-txt-fixer
+      - id: trailing-whitespace
+
+  - repo: https://github.com/psf/black
+    rev: 24.3.0
+    hooks:
+      - id: black
+        name: Format code
+
+  - repo: https://github.com/PyCQA/isort
+    rev: 5.13.2
+    hooks:
+      - id: isort
+        name: Format imports
+        exclude: docs/
@@ -52,7 +52,7 @@ We use ``black`` as our style guide. To fix your format run `pip install pre-com
 1. Minimize the use of ``**kwargs``.
 1. ``RaiseError`` is preferred to ``assert``. Write: ```if X: raise Error``` instead of ```assert X```.
 1. Classes are preferred to standalone methods.
-1. Methods should be atomic. A method shouldn't be longer than 75 lines, e.g. can be fit into the computer screen without scrolling.
+1. Methods should be atomic. A method shouldn't be longer than 88 lines, e.g. can be fit into the computer screen without scrolling.
 1. If a method has arguments that don't fit into one line, each argument should be in its own line for readability.
 1. Add ``__init__.py`` for every folder.
 1. F-strings are prefered to formatted strings.
 
@@ -14,7 +14,7 @@ We currently support the following data-curation modules. For more details on ea
  - [Text reformatting and cleaning](docs/user-guide/LanguageIdentificationUnicodeFormatting.rst)
    - Fix unicode decoding errors via [ftfy](https://ftfy.readthedocs.io/en/latest/)
  - [Quality filtering](docs/user-guide/QualityFiltering.rst)
-   - Multilingual heuristic-based filtering 
+   - Multilingual heuristic-based filtering
    - Classifier-based filtering via [fastText](https://fasttext.cc/)
  - [Document-level deduplication](docs/user-guide/GpuDeduplication.rst)
    - Both exact and fuzzy deduplication are accelerated using cuDF and Dask.
@@ -79,7 +79,7 @@ Note: This is not the only way to run NeMo Curator on SLURM. There are example s
 
 ## Module Ablation and Compute Performance
 
-The modules within NeMo Curator were in large part designed to curate high-quality documents from Common Crawl snapshots and to be able to do so 
+The modules within NeMo Curator were in large part designed to curate high-quality documents from Common Crawl snapshots and to be able to do so
 in a scalable manner. In order to assess the quality of the Common Crawl documents curated by the modules in NeMo Curator, we performed a series
 of ablation experiments in which we trained a 357M-parameter GPT-style model on the datasets resulting from the different stages of our data curation
 pipeline implemented in NeMo Curator. The figure below demonstrates that the different data curation modules implemented within NeMo Curator
@@ -89,7 +89,7 @@ lead to improved model zero-shot downstream task performance.
   <img src="./docs/user-guide/images/zeroshot_ablations.png" alt="drawing" width="700"/>
 </p>
 
-In terms of scalability and compute performance, using the RAPIDS + Dask fuzzy deduplication, we are able to deduplicate the 1.1 Trillion token Red Pajama dataset in 1.8 hours using 64 A100s. 
+In terms of scalability and compute performance, using the RAPIDS + Dask fuzzy deduplication, we are able to deduplicate the 1.1 Trillion token Red Pajama dataset in 1.8 hours using 64 A100s.
 
 Additionally, using the CPU-based modules the table below shows the time required and resulting data size reduction of each step of processing the [Common Crawl snapshot from November/December of 2020](https://commoncrawl.org/2020/12/nov-dec-2020-crawl-archive-now-available/) using 30 CPU nodes (with hardware similar to the `c5.24xlarge` [Amazon AWS C5 instance](https://aws.amazon.com/ec2/instance-types/c5/)):
 
@@ -128,4 +128,4 @@ Additionally, using the CPU-based modules the table below shows the time require
 
 As mentioned above, the modules within NeMo Curator enable users to scale data-mining and NLP processing tasks to many nodes within a compute cluster.
 The modules accomplish this using [Dask](https://www.dask.org/) with [cuDF](https://docs.rapids.ai/api/cudf/nightly/user_guide/10min/) (for the GPU-accelerated modules).
-At the core of the NeMo Curator, `DocumentDataset` (the main dataset class) is just a simple wrapper around a Dask dataframe. Dask allows NeMo Curator to scale to arbitrary cluster sizes, and it supports a variety of distributed computing platforms. It supports reading and writing to different file formats, and it can balance these operations among nodes in the cluster. Importantly, Dask also supports the RAPIDS cuDF library for GPU-acclerated exact and fuzzy deduplication.
+At the core of the NeMo Curator, `DocumentDataset` (the main dataset class) is just a simple wrapper around a Dask dataframe. Dask allows NeMo Curator to scale to arbitrary cluster sizes, and it supports a variety of distributed computing platforms. It supports reading and writing to different file formats, and it can balance these operations among nodes in the cluster. Importantly, Dask also supports the RAPIDS cuDF library for GPU-acclerated exact and fuzzy deduplication.
@@ -21,4 +21,4 @@ While NVIDIA currently does not have a bug bounty program, we do offer acknowled
 
 ## NVIDIA Product Security
 
-For all security-related concerns, please visit NVIDIA's Product Security portal at https://www.nvidia.com/en-us/security
+For all security-related concerns, please visit NVIDIA's Product Security portal at https://www.nvidia.com/en-us/security
@@ -1,11 +1,11 @@
 download_module: nemo_curator.download.arxiv.ArxivDownloader
 download_params: {}
 iterator_module: nemo_curator.download.arxiv.ArxivIterator
-iterator_params: 
+iterator_params:
   log_frequency: 1000
 extract_module: nemo_curator.download.arxiv.ArxivExtractor
 extract_params: {}
 format:
   text: str
   id: str
-  source_id: str
+  source_id: str
@@ -9,4 +9,4 @@ format:
   language: str
   url: str
   warc_id: str
-  source_id: str
+  source_id: str
@@ -1,7 +1,7 @@
 input_field: text
 filters:
   # The filters below define a chain of heuristic filters to be applied to each document in a corpus.
-  # This particular cascade of filters is intended to filter Python code data. 
+  # This particular cascade of filters is intended to filter Python code data.
   # The filter listed at the top will be applied first, and the following filters will be applied in
   # the order they appear in this file. Each filter can be removed and re-ordered as desired.
   # Change this based on the language of the data
 
@@ -1,7 +1,7 @@
 input_field: text
 filters:
   # The filters below define a chain of heuristic filters to be applied to each document in a corpus.
-  # This particular cascade of filters is intended to filter English language data. 
+  # This particular cascade of filters is intended to filter English language data.
   # The filter listed at the top will be applied first, and the following filters will be applied in
   # the order they appear in this file. Each filter can be removed and re-ordered as desired.
   - name: nemo_curator.filters.heuristic_filter.NonAlphaNumericFilter
@@ -14,16 +14,16 @@ filters:
     params:
       max_number_to_text_ratio: 0.15
   - name: nemo_curator.filters.heuristic_filter.UrlsFilter
-    params: 
+    params:
       max_url_to_text_ratio: 0.2
   - name: nemo_curator.filters.heuristic_filter.WhiteSpaceFilter
-    params: 
+    params:
       max_white_space_ratio: 0.25
   - name: nemo_curator.filters.heuristic_filter.ParenthesesFilter
-    params: 
+    params:
       max_parentheses_ratio: 0.1
   - name: nemo_curator.filters.heuristic_filter.BoilerPlateStringFilter
-    params: 
+    params:
       remove_if_at_top_or_bottom: True
       max_boilerplate_string_ratio: 0.4
   - name: nemo_curator.filters.heuristic_filter.RepeatedLinesFilter
@@ -46,18 +46,18 @@ filters:
     params:
       max_num_sentences_without_endmark_ratio: 0.85
   - name: nemo_curator.filters.heuristic_filter.WordsWithoutAlphabetsFilter
-    params: 
+    params:
       min_words_with_alphabets: 0.8
   - name: nemo_curator.filters.heuristic_filter.CommonEnglishWordsFilter
     params:
       min_num_common_words: 2
       stop_at_false: True
   - name: nemo_curator.filters.heuristic_filter.MeanWordLengthFilter
     params:
-      max_mean_word_length: 10 
+      max_mean_word_length: 10
       min_mean_word_length: 3
   - name: nemo_curator.filters.heuristic_filter.LongWordFilter
-    params: 
+    params:
       max_word_length: 1000
   - name: nemo_curator.filters.heuristic_filter.EllipsisFilter
     params:
@@ -102,4 +102,4 @@ filters:
       max_repeating_duplicate_ngram_ratio: 0.10
   - name: nemo_curator.filters.heuristic_filter.BulletsFilter
     params:
-      max_bullet_lines_ratio: 0.9
+      max_bullet_lines_ratio: 0.9
Original file line number	Diff line number	Diff line change
`@@ -21,4 +21,4 @@ While NVIDIA currently does not have a bug bounty program, we do offer acknowled`
`21`	`21`
`22`	`22`	`## NVIDIA Product Security`
`23`	`23`
`24`		`-For all security-related concerns, please visit NVIDIA's Product Security portal at https://www.nvidia.com/en-us/security`
	`24`	`+For all security-related concerns, please visit NVIDIA's Product Security portal at https://www.nvidia.com/en-us/security`