Skip to content

Commit 1c73ade

Browse files
authored
Add pre-commit style checks (NVIDIA#14)
* Updates for pre-commit CI tests, add black, isort and other pre commit configs Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com> * Fix circular imports Co-authored-by: Ryan Wolf <rywolf@nvidia.com> Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com> [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci Add copyright & update py_version to 310 Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com> --------- Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com>
1 parent 2cd02f3 commit 1c73ade

File tree

147 files changed

+7255
-5910
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

147 files changed

+7255
-5910
lines changed

.github/workflows/test.yml

-2
Original file line numberDiff line numberDiff line change
@@ -40,5 +40,3 @@ jobs:
4040
# TODO: Remove env variable when gpu dependencies are optional
4141
run: |
4242
RAPIDS_NO_INITIALIZE=1 python -m pytest -v --cpu
43-
44-

.pre-commit-config.yaml

+47
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,47 @@
1+
# Copyright (c) 2024, NVIDIA CORPORATION. All rights reserved.
2+
#
3+
# Licensed under the Apache License, Version 2.0 (the "License");
4+
# you may not use this file except in compliance with the License.
5+
# You may obtain a copy of the License at
6+
#
7+
# http://www.apache.org/licenses/LICENSE-2.0
8+
#
9+
# Unless required by applicable law or agreed to in writing, software
10+
# distributed under the License is distributed on an "AS IS" BASIS,
11+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12+
# See the License for the specific language governing permissions and
13+
# limitations under the License.
14+
15+
default_language_version:
16+
python: python3
17+
18+
ci:
19+
autofix_prs: true
20+
autoupdate_commit_msg: '[pre-commit.ci] pre-commit suggestions'
21+
autoupdate_schedule: quarterly
22+
23+
repos:
24+
- repo: https://github.com/pre-commit/pre-commit-hooks
25+
rev: v4.5.0
26+
hooks:
27+
- id: check-added-large-files
28+
args: ['--maxkb=1000']
29+
- id: check-case-conflict
30+
- id: check-yaml
31+
- id: detect-private-key
32+
- id: end-of-file-fixer
33+
- id: requirements-txt-fixer
34+
- id: trailing-whitespace
35+
36+
- repo: https://github.com/psf/black
37+
rev: 24.3.0
38+
hooks:
39+
- id: black
40+
name: Format code
41+
42+
- repo: https://github.com/PyCQA/isort
43+
rev: 5.13.2
44+
hooks:
45+
- id: isort
46+
name: Format imports
47+
exclude: docs/

.style.yapf

-3
This file was deleted.

CONTRIBUTING.md

+1-1
Original file line numberDiff line numberDiff line change
@@ -52,7 +52,7 @@ We use ``black`` as our style guide. To fix your format run `pip install pre-com
5252
1. Minimize the use of ``**kwargs``.
5353
1. ``RaiseError`` is preferred to ``assert``. Write: ```if X: raise Error``` instead of ```assert X```.
5454
1. Classes are preferred to standalone methods.
55-
1. Methods should be atomic. A method shouldn't be longer than 75 lines, e.g. can be fit into the computer screen without scrolling.
55+
1. Methods should be atomic. A method shouldn't be longer than 88 lines, e.g. can be fit into the computer screen without scrolling.
5656
1. If a method has arguments that don't fit into one line, each argument should be in its own line for readability.
5757
1. Add ``__init__.py`` for every folder.
5858
1. F-strings are prefered to formatted strings.

README.md

+4-4
Original file line numberDiff line numberDiff line change
@@ -14,7 +14,7 @@ We currently support the following data-curation modules. For more details on ea
1414
- [Text reformatting and cleaning](docs/user-guide/LanguageIdentificationUnicodeFormatting.rst)
1515
- Fix unicode decoding errors via [ftfy](https://ftfy.readthedocs.io/en/latest/)
1616
- [Quality filtering](docs/user-guide/QualityFiltering.rst)
17-
- Multilingual heuristic-based filtering
17+
- Multilingual heuristic-based filtering
1818
- Classifier-based filtering via [fastText](https://fasttext.cc/)
1919
- [Document-level deduplication](docs/user-guide/GpuDeduplication.rst)
2020
- Both exact and fuzzy deduplication are accelerated using cuDF and Dask.
@@ -79,7 +79,7 @@ Note: This is not the only way to run NeMo Curator on SLURM. There are example s
7979

8080
## Module Ablation and Compute Performance
8181

82-
The modules within NeMo Curator were in large part designed to curate high-quality documents from Common Crawl snapshots and to be able to do so
82+
The modules within NeMo Curator were in large part designed to curate high-quality documents from Common Crawl snapshots and to be able to do so
8383
in a scalable manner. In order to assess the quality of the Common Crawl documents curated by the modules in NeMo Curator, we performed a series
8484
of ablation experiments in which we trained a 357M-parameter GPT-style model on the datasets resulting from the different stages of our data curation
8585
pipeline implemented in NeMo Curator. The figure below demonstrates that the different data curation modules implemented within NeMo Curator
@@ -89,7 +89,7 @@ lead to improved model zero-shot downstream task performance.
8989
<img src="./docs/user-guide/images/zeroshot_ablations.png" alt="drawing" width="700"/>
9090
</p>
9191

92-
In terms of scalability and compute performance, using the RAPIDS + Dask fuzzy deduplication, we are able to deduplicate the 1.1 Trillion token Red Pajama dataset in 1.8 hours using 64 A100s.
92+
In terms of scalability and compute performance, using the RAPIDS + Dask fuzzy deduplication, we are able to deduplicate the 1.1 Trillion token Red Pajama dataset in 1.8 hours using 64 A100s.
9393

9494
Additionally, using the CPU-based modules the table below shows the time required and resulting data size reduction of each step of processing the [Common Crawl snapshot from November/December of 2020](https://commoncrawl.org/2020/12/nov-dec-2020-crawl-archive-now-available/) using 30 CPU nodes (with hardware similar to the `c5.24xlarge` [Amazon AWS C5 instance](https://aws.amazon.com/ec2/instance-types/c5/)):
9595

@@ -128,4 +128,4 @@ Additionally, using the CPU-based modules the table below shows the time require
128128

129129
As mentioned above, the modules within NeMo Curator enable users to scale data-mining and NLP processing tasks to many nodes within a compute cluster.
130130
The modules accomplish this using [Dask](https://www.dask.org/) with [cuDF](https://docs.rapids.ai/api/cudf/nightly/user_guide/10min/) (for the GPU-accelerated modules).
131-
At the core of the NeMo Curator, `DocumentDataset` (the main dataset class) is just a simple wrapper around a Dask dataframe. Dask allows NeMo Curator to scale to arbitrary cluster sizes, and it supports a variety of distributed computing platforms. It supports reading and writing to different file formats, and it can balance these operations among nodes in the cluster. Importantly, Dask also supports the RAPIDS cuDF library for GPU-acclerated exact and fuzzy deduplication.
131+
At the core of the NeMo Curator, `DocumentDataset` (the main dataset class) is just a simple wrapper around a Dask dataframe. Dask allows NeMo Curator to scale to arbitrary cluster sizes, and it supports a variety of distributed computing platforms. It supports reading and writing to different file formats, and it can balance these operations among nodes in the cluster. Importantly, Dask also supports the RAPIDS cuDF library for GPU-acclerated exact and fuzzy deduplication.

SECURITY.md

+1-1
Original file line numberDiff line numberDiff line change
@@ -21,4 +21,4 @@ While NVIDIA currently does not have a bug bounty program, we do offer acknowled
2121

2222
## NVIDIA Product Security
2323

24-
For all security-related concerns, please visit NVIDIA's Product Security portal at https://www.nvidia.com/en-us/security
24+
For all security-related concerns, please visit NVIDIA's Product Security portal at https://www.nvidia.com/en-us/security

config/arxiv_builder.yaml

+2-2
Original file line numberDiff line numberDiff line change
@@ -1,11 +1,11 @@
11
download_module: nemo_curator.download.arxiv.ArxivDownloader
22
download_params: {}
33
iterator_module: nemo_curator.download.arxiv.ArxivIterator
4-
iterator_params:
4+
iterator_params:
55
log_frequency: 1000
66
extract_module: nemo_curator.download.arxiv.ArxivExtractor
77
extract_params: {}
88
format:
99
text: str
1010
id: str
11-
source_id: str
11+
source_id: str

config/cc_warc_builder.yaml

+1-1
Original file line numberDiff line numberDiff line change
@@ -9,4 +9,4 @@ format:
99
language: str
1010
url: str
1111
warc_id: str
12-
source_id: str
12+
source_id: str

config/heuristic_filter_code.yaml

+1-1
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
input_field: text
22
filters:
33
# The filters below define a chain of heuristic filters to be applied to each document in a corpus.
4-
# This particular cascade of filters is intended to filter Python code data.
4+
# This particular cascade of filters is intended to filter Python code data.
55
# The filter listed at the top will be applied first, and the following filters will be applied in
66
# the order they appear in this file. Each filter can be removed and re-ordered as desired.
77
# Change this based on the language of the data

config/heuristic_filter_en.yaml

+9-9
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
input_field: text
22
filters:
33
# The filters below define a chain of heuristic filters to be applied to each document in a corpus.
4-
# This particular cascade of filters is intended to filter English language data.
4+
# This particular cascade of filters is intended to filter English language data.
55
# The filter listed at the top will be applied first, and the following filters will be applied in
66
# the order they appear in this file. Each filter can be removed and re-ordered as desired.
77
- name: nemo_curator.filters.heuristic_filter.NonAlphaNumericFilter
@@ -14,16 +14,16 @@ filters:
1414
params:
1515
max_number_to_text_ratio: 0.15
1616
- name: nemo_curator.filters.heuristic_filter.UrlsFilter
17-
params:
17+
params:
1818
max_url_to_text_ratio: 0.2
1919
- name: nemo_curator.filters.heuristic_filter.WhiteSpaceFilter
20-
params:
20+
params:
2121
max_white_space_ratio: 0.25
2222
- name: nemo_curator.filters.heuristic_filter.ParenthesesFilter
23-
params:
23+
params:
2424
max_parentheses_ratio: 0.1
2525
- name: nemo_curator.filters.heuristic_filter.BoilerPlateStringFilter
26-
params:
26+
params:
2727
remove_if_at_top_or_bottom: True
2828
max_boilerplate_string_ratio: 0.4
2929
- name: nemo_curator.filters.heuristic_filter.RepeatedLinesFilter
@@ -46,18 +46,18 @@ filters:
4646
params:
4747
max_num_sentences_without_endmark_ratio: 0.85
4848
- name: nemo_curator.filters.heuristic_filter.WordsWithoutAlphabetsFilter
49-
params:
49+
params:
5050
min_words_with_alphabets: 0.8
5151
- name: nemo_curator.filters.heuristic_filter.CommonEnglishWordsFilter
5252
params:
5353
min_num_common_words: 2
5454
stop_at_false: True
5555
- name: nemo_curator.filters.heuristic_filter.MeanWordLengthFilter
5656
params:
57-
max_mean_word_length: 10
57+
max_mean_word_length: 10
5858
min_mean_word_length: 3
5959
- name: nemo_curator.filters.heuristic_filter.LongWordFilter
60-
params:
60+
params:
6161
max_word_length: 1000
6262
- name: nemo_curator.filters.heuristic_filter.EllipsisFilter
6363
params:
@@ -102,4 +102,4 @@ filters:
102102
max_repeating_duplicate_ngram_ratio: 0.10
103103
- name: nemo_curator.filters.heuristic_filter.BulletsFilter
104104
params:
105-
max_bullet_lines_ratio: 0.9
105+
max_bullet_lines_ratio: 0.9

config/heuristic_filter_non-en.yaml

+9-9
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
input_field: text
22
filters:
33
# The filters below define a chain of heuristic filters to be applied to each document in a corpus.
4-
# This particular cascade of filters is intended to filter generic non-English data that use spaces for separating words.
4+
# This particular cascade of filters is intended to filter generic non-English data that use spaces for separating words.
55
# The filter listed at the top will be applied first, and the following filters will be applied in
66
# the order they appear in this file. Each filter can be removed and re-ordered as desired.
77
- name: nemo_curator.filters.heuristic_filter.SymbolsToWordsFilter
@@ -11,16 +11,16 @@ filters:
1111
params:
1212
max_number_to_text_ratio: 0.15
1313
- name: nemo_curator.filters.heuristic_filter.UrlsFilter
14-
params:
14+
params:
1515
max_url_to_text_ratio: 0.2
1616
- name: nemo_curator.filters.heuristic_filter.WhiteSpaceFilter
17-
params:
17+
params:
1818
max_white_space_ratio: 0.25
1919
- name: nemo_curator.filters.heuristic_filter.ParenthesesFilter
20-
params:
20+
params:
2121
max_parentheses_ratio: 0.1
2222
- name: nemo_curator.filters.heuristic_filter.BoilerPlateStringFilter
23-
params:
23+
params:
2424
remove_if_at_top_or_bottom: True
2525
max_boilerplate_string_ratio: 0.4
2626
- name: nemo_curator.filters.heuristic_filter.RepeatedLinesFilter
@@ -39,17 +39,17 @@ filters:
3939
params:
4040
min_words: 50
4141
max_words: 100000
42-
# NOTE: This filter tends to remove many documents and will need to
42+
# NOTE: This filter tends to remove many documents and will need to
4343
# be tuned per language
4444
- name: nemo_curator.filters.heuristic_filter.PunctuationFilter
4545
params:
4646
max_num_sentences_without_endmark_ratio: 0.85
4747
- name: nemo_curator.filters.heuristic_filter.MeanWordLengthFilter
4848
params:
49-
max_mean_word_length: 10
49+
max_mean_word_length: 10
5050
min_mean_word_length: 3
5151
- name: nemo_curator.filters.heuristic_filter.LongWordFilter
52-
params:
52+
params:
5353
max_word_length: 1000
5454
- name: nemo_curator.filters.heuristic_filter.EllipsisFilter
5555
params:
@@ -94,4 +94,4 @@ filters:
9494
max_repeating_duplicate_ngram_ratio: 0.10
9595
- name: nemo_curator.filters.heuristic_filter.BulletsFilter
9696
params:
97-
max_bullet_lines_ratio: 0.9
97+
max_bullet_lines_ratio: 0.9

config/lm_tasks.yaml

+1-1
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
tasks:
22
# The Python modules below define language model downstream evaluation
3-
# task data. If one of the below tasks is specified, N-grams will
3+
# task data. If one of the below tasks is specified, N-grams will
44
# be constructed from the documents that make up the task data
55
# using the script prepare_task_data.
66
# find_matching_ngrams will then search for these N-grams

config/pii_config.yaml

+1-1
Original file line numberDiff line numberDiff line change
@@ -13,4 +13,4 @@ pii_config:
1313
#type: 'hash'
1414
#hash_type: 'sha256'
1515

16-
#type: 'redact'
16+
#type: 'redact'

config/wikipedia_builder.yaml

+1-1
Original file line numberDiff line numberDiff line change
@@ -12,4 +12,4 @@ format:
1212
id: str
1313
url: str
1414
language: str
15-
source_id: str
15+
source_id: str

docs/user-guide/CPUvsGPU.rst

+1-1
Original file line numberDiff line numberDiff line change
@@ -95,4 +95,4 @@ Every SLURM cluster is different, so make sure you understand how your SLURM clu
9595
``start-slurm.sh`` calls ``containter-entrypoint.sh`` which sets up a Dask scheduler and workers across the cluster.
9696

9797
Our Python examples are designed to work such that they can be run locally on their own, or easily substituted into the ``start-slurm.sh`` to run on multiple nodes.
98-
You can adapt your scripts easily too by simply following the pattern of adding ``get_client`` with ``add_distributed_args``.
98+
You can adapt your scripts easily too by simply following the pattern of adding ``get_client`` with ``add_distributed_args``.

docs/user-guide/DistributedDataClassification.rst

+2-2
Original file line numberDiff line numberDiff line change
@@ -8,7 +8,7 @@ Background
88

99
When preparing text data to be used in training a large language model (LLM), it is useful to classify
1010
text documents in various ways, to enhance the LLM's performance by making it able to produce more
11-
contextually appropriate and accurate language across various subjects. NeMo Curator provides this module to
11+
contextually appropriate and accurate language across various subjects. NeMo Curator provides this module to
1212
help a user run inference with pre-trained models on large amounts of text documents. We achieve
1313
this by chunking the datasets across multiple computing nodes, each equipped with multiple GPUs, to
1414
accelerate the classification task in a distributed way. In other words, because the classification of
@@ -68,4 +68,4 @@ The key differences is that it operates on the GPU instead of the CPU.
6868
Therefore, the Dask cluster must be started as a GPU one.
6969
And, ``DomainClassifier`` requires ``DocumentDataset`` to be on the GPU (i.e., have ``backend=cudf``).
7070
It is easy to extend ``DistributedDataClassifier`` to your own model.
71-
Check out ``nemo_curator.modules.distributed_data_classifier.py`` for reference.
71+
Check out ``nemo_curator.modules.distributed_data_classifier.py`` for reference.

docs/user-guide/DocumentDataset.rst

+4-4
Original file line numberDiff line numberDiff line change
@@ -48,7 +48,7 @@ You could read, filter the dataset, and write it using the following methods
4848
text_field="text",
4949
score_field="word_count",
5050
)
51-
51+
5252
long_books = filter_step(books)
5353
5454
long_books.to_json("long_books/", write_to_filename=True)
@@ -106,7 +106,7 @@ Consider a modified version of the code above:
106106
text_field="text",
107107
score_field="word_count",
108108
)
109-
109+
110110
long_books = filter_step(books)
111111
112112
long_books.to_json("long_books/", write_to_filename=True)
@@ -130,10 +130,10 @@ In these cases, we recommend processing the input dataset in batches using a sim
130130
text_field="text",
131131
score_field="word_count",
132132
)
133-
133+
134134
long_books = filter_step(books)
135135
136136
long_books.to_json("long_books/", write_to_filename=True)
137137
138138
This will read in 64 shards at a time, process them, and write them back to disk.
139-
Like ``get_remaining_files``, it only includes files that are in the input directory and not in the output directory.
139+
Like ``get_remaining_files``, it only includes files that are in the input directory and not in the output directory.

0 commit comments

Comments
 (0)