Skip to content

Commit 1c73ade

Browse files
authored
Add pre-commit style checks (#14)
* Updates for pre-commit CI tests, add black, isort and other pre commit configs Signed-off-by: Ayush Dattagupta <[email protected]> * Fix circular imports Co-authored-by: Ryan Wolf <[email protected]> Signed-off-by: Ayush Dattagupta <[email protected]> [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci Add copyright & update py_version to 310 Signed-off-by: Ayush Dattagupta <[email protected]> --------- Signed-off-by: Ayush Dattagupta <[email protected]>
1 parent 2cd02f3 commit 1c73ade

File tree

147 files changed

+7255
-5910
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

147 files changed

+7255
-5910
lines changed

.github/workflows/test.yml

Lines changed: 0 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -40,5 +40,3 @@ jobs:
4040
# TODO: Remove env variable when gpu dependencies are optional
4141
run: |
4242
RAPIDS_NO_INITIALIZE=1 python -m pytest -v --cpu
43-
44-

.pre-commit-config.yaml

Lines changed: 47 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,47 @@
1+
# Copyright (c) 2024, NVIDIA CORPORATION. All rights reserved.
2+
#
3+
# Licensed under the Apache License, Version 2.0 (the "License");
4+
# you may not use this file except in compliance with the License.
5+
# You may obtain a copy of the License at
6+
#
7+
# http://www.apache.org/licenses/LICENSE-2.0
8+
#
9+
# Unless required by applicable law or agreed to in writing, software
10+
# distributed under the License is distributed on an "AS IS" BASIS,
11+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12+
# See the License for the specific language governing permissions and
13+
# limitations under the License.
14+
15+
default_language_version:
16+
python: python3
17+
18+
ci:
19+
autofix_prs: true
20+
autoupdate_commit_msg: '[pre-commit.ci] pre-commit suggestions'
21+
autoupdate_schedule: quarterly
22+
23+
repos:
24+
- repo: https://github.com/pre-commit/pre-commit-hooks
25+
rev: v4.5.0
26+
hooks:
27+
- id: check-added-large-files
28+
args: ['--maxkb=1000']
29+
- id: check-case-conflict
30+
- id: check-yaml
31+
- id: detect-private-key
32+
- id: end-of-file-fixer
33+
- id: requirements-txt-fixer
34+
- id: trailing-whitespace
35+
36+
- repo: https://github.com/psf/black
37+
rev: 24.3.0
38+
hooks:
39+
- id: black
40+
name: Format code
41+
42+
- repo: https://github.com/PyCQA/isort
43+
rev: 5.13.2
44+
hooks:
45+
- id: isort
46+
name: Format imports
47+
exclude: docs/

.style.yapf

Lines changed: 0 additions & 3 deletions
This file was deleted.

CONTRIBUTING.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -52,7 +52,7 @@ We use ``black`` as our style guide. To fix your format run `pip install pre-com
5252
1. Minimize the use of ``**kwargs``.
5353
1. ``RaiseError`` is preferred to ``assert``. Write: ```if X: raise Error``` instead of ```assert X```.
5454
1. Classes are preferred to standalone methods.
55-
1. Methods should be atomic. A method shouldn't be longer than 75 lines, e.g. can be fit into the computer screen without scrolling.
55+
1. Methods should be atomic. A method shouldn't be longer than 88 lines, e.g. can be fit into the computer screen without scrolling.
5656
1. If a method has arguments that don't fit into one line, each argument should be in its own line for readability.
5757
1. Add ``__init__.py`` for every folder.
5858
1. F-strings are prefered to formatted strings.

README.md

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -14,7 +14,7 @@ We currently support the following data-curation modules. For more details on ea
1414
- [Text reformatting and cleaning](docs/user-guide/LanguageIdentificationUnicodeFormatting.rst)
1515
- Fix unicode decoding errors via [ftfy](https://ftfy.readthedocs.io/en/latest/)
1616
- [Quality filtering](docs/user-guide/QualityFiltering.rst)
17-
- Multilingual heuristic-based filtering
17+
- Multilingual heuristic-based filtering
1818
- Classifier-based filtering via [fastText](https://fasttext.cc/)
1919
- [Document-level deduplication](docs/user-guide/GpuDeduplication.rst)
2020
- Both exact and fuzzy deduplication are accelerated using cuDF and Dask.
@@ -79,7 +79,7 @@ Note: This is not the only way to run NeMo Curator on SLURM. There are example s
7979

8080
## Module Ablation and Compute Performance
8181

82-
The modules within NeMo Curator were in large part designed to curate high-quality documents from Common Crawl snapshots and to be able to do so
82+
The modules within NeMo Curator were in large part designed to curate high-quality documents from Common Crawl snapshots and to be able to do so
8383
in a scalable manner. In order to assess the quality of the Common Crawl documents curated by the modules in NeMo Curator, we performed a series
8484
of ablation experiments in which we trained a 357M-parameter GPT-style model on the datasets resulting from the different stages of our data curation
8585
pipeline implemented in NeMo Curator. The figure below demonstrates that the different data curation modules implemented within NeMo Curator
@@ -89,7 +89,7 @@ lead to improved model zero-shot downstream task performance.
8989
<img src="./docs/user-guide/images/zeroshot_ablations.png" alt="drawing" width="700"/>
9090
</p>
9191

92-
In terms of scalability and compute performance, using the RAPIDS + Dask fuzzy deduplication, we are able to deduplicate the 1.1 Trillion token Red Pajama dataset in 1.8 hours using 64 A100s.
92+
In terms of scalability and compute performance, using the RAPIDS + Dask fuzzy deduplication, we are able to deduplicate the 1.1 Trillion token Red Pajama dataset in 1.8 hours using 64 A100s.
9393

9494
Additionally, using the CPU-based modules the table below shows the time required and resulting data size reduction of each step of processing the [Common Crawl snapshot from November/December of 2020](https://commoncrawl.org/2020/12/nov-dec-2020-crawl-archive-now-available/) using 30 CPU nodes (with hardware similar to the `c5.24xlarge` [Amazon AWS C5 instance](https://aws.amazon.com/ec2/instance-types/c5/)):
9595

@@ -128,4 +128,4 @@ Additionally, using the CPU-based modules the table below shows the time require
128128

129129
As mentioned above, the modules within NeMo Curator enable users to scale data-mining and NLP processing tasks to many nodes within a compute cluster.
130130
The modules accomplish this using [Dask](https://www.dask.org/) with [cuDF](https://docs.rapids.ai/api/cudf/nightly/user_guide/10min/) (for the GPU-accelerated modules).
131-
At the core of the NeMo Curator, `DocumentDataset` (the main dataset class) is just a simple wrapper around a Dask dataframe. Dask allows NeMo Curator to scale to arbitrary cluster sizes, and it supports a variety of distributed computing platforms. It supports reading and writing to different file formats, and it can balance these operations among nodes in the cluster. Importantly, Dask also supports the RAPIDS cuDF library for GPU-acclerated exact and fuzzy deduplication.
131+
At the core of the NeMo Curator, `DocumentDataset` (the main dataset class) is just a simple wrapper around a Dask dataframe. Dask allows NeMo Curator to scale to arbitrary cluster sizes, and it supports a variety of distributed computing platforms. It supports reading and writing to different file formats, and it can balance these operations among nodes in the cluster. Importantly, Dask also supports the RAPIDS cuDF library for GPU-acclerated exact and fuzzy deduplication.

SECURITY.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -21,4 +21,4 @@ While NVIDIA currently does not have a bug bounty program, we do offer acknowled
2121

2222
## NVIDIA Product Security
2323

24-
For all security-related concerns, please visit NVIDIA's Product Security portal at https://www.nvidia.com/en-us/security
24+
For all security-related concerns, please visit NVIDIA's Product Security portal at https://www.nvidia.com/en-us/security

config/arxiv_builder.yaml

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,11 +1,11 @@
11
download_module: nemo_curator.download.arxiv.ArxivDownloader
22
download_params: {}
33
iterator_module: nemo_curator.download.arxiv.ArxivIterator
4-
iterator_params:
4+
iterator_params:
55
log_frequency: 1000
66
extract_module: nemo_curator.download.arxiv.ArxivExtractor
77
extract_params: {}
88
format:
99
text: str
1010
id: str
11-
source_id: str
11+
source_id: str

config/cc_warc_builder.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -9,4 +9,4 @@ format:
99
language: str
1010
url: str
1111
warc_id: str
12-
source_id: str
12+
source_id: str

config/heuristic_filter_code.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
input_field: text
22
filters:
33
# The filters below define a chain of heuristic filters to be applied to each document in a corpus.
4-
# This particular cascade of filters is intended to filter Python code data.
4+
# This particular cascade of filters is intended to filter Python code data.
55
# The filter listed at the top will be applied first, and the following filters will be applied in
66
# the order they appear in this file. Each filter can be removed and re-ordered as desired.
77
# Change this based on the language of the data

config/heuristic_filter_en.yaml

Lines changed: 9 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
input_field: text
22
filters:
33
# The filters below define a chain of heuristic filters to be applied to each document in a corpus.
4-
# This particular cascade of filters is intended to filter English language data.
4+
# This particular cascade of filters is intended to filter English language data.
55
# The filter listed at the top will be applied first, and the following filters will be applied in
66
# the order they appear in this file. Each filter can be removed and re-ordered as desired.
77
- name: nemo_curator.filters.heuristic_filter.NonAlphaNumericFilter
@@ -14,16 +14,16 @@ filters:
1414
params:
1515
max_number_to_text_ratio: 0.15
1616
- name: nemo_curator.filters.heuristic_filter.UrlsFilter
17-
params:
17+
params:
1818
max_url_to_text_ratio: 0.2
1919
- name: nemo_curator.filters.heuristic_filter.WhiteSpaceFilter
20-
params:
20+
params:
2121
max_white_space_ratio: 0.25
2222
- name: nemo_curator.filters.heuristic_filter.ParenthesesFilter
23-
params:
23+
params:
2424
max_parentheses_ratio: 0.1
2525
- name: nemo_curator.filters.heuristic_filter.BoilerPlateStringFilter
26-
params:
26+
params:
2727
remove_if_at_top_or_bottom: True
2828
max_boilerplate_string_ratio: 0.4
2929
- name: nemo_curator.filters.heuristic_filter.RepeatedLinesFilter
@@ -46,18 +46,18 @@ filters:
4646
params:
4747
max_num_sentences_without_endmark_ratio: 0.85
4848
- name: nemo_curator.filters.heuristic_filter.WordsWithoutAlphabetsFilter
49-
params:
49+
params:
5050
min_words_with_alphabets: 0.8
5151
- name: nemo_curator.filters.heuristic_filter.CommonEnglishWordsFilter
5252
params:
5353
min_num_common_words: 2
5454
stop_at_false: True
5555
- name: nemo_curator.filters.heuristic_filter.MeanWordLengthFilter
5656
params:
57-
max_mean_word_length: 10
57+
max_mean_word_length: 10
5858
min_mean_word_length: 3
5959
- name: nemo_curator.filters.heuristic_filter.LongWordFilter
60-
params:
60+
params:
6161
max_word_length: 1000
6262
- name: nemo_curator.filters.heuristic_filter.EllipsisFilter
6363
params:
@@ -102,4 +102,4 @@ filters:
102102
max_repeating_duplicate_ngram_ratio: 0.10
103103
- name: nemo_curator.filters.heuristic_filter.BulletsFilter
104104
params:
105-
max_bullet_lines_ratio: 0.9
105+
max_bullet_lines_ratio: 0.9

0 commit comments

Comments
 (0)