ayushdg
diff --git a/‎.github/workflows/test.yml
-2 b/‎.github/workflows/test.yml
-2
diff --git a/‎.pre-commit-config.yaml
+47 b/‎.pre-commit-config.yaml
+47
diff --git a/‎.style.yapf
-3 b/‎.style.yapf
-3
diff --git a/‎CONTRIBUTING.md
+1-1 b/‎CONTRIBUTING.md
+1-1
diff --git a/‎README.md
+4-4 b/‎README.md
+4-4
diff --git a/‎SECURITY.md
+1-1 b/‎SECURITY.md
+1-1
diff --git a/‎config/arxiv_builder.yaml
+2-2 b/‎config/arxiv_builder.yaml
+2-2
diff --git a/‎config/cc_warc_builder.yaml
+1-1 b/‎config/cc_warc_builder.yaml
+1-1
diff --git a/‎config/heuristic_filter_code.yaml
+1-1 b/‎config/heuristic_filter_code.yaml
+1-1
diff --git a/‎config/heuristic_filter_en.yaml
+9-9 b/‎config/heuristic_filter_en.yaml
+9-9
diff --git a/‎config/heuristic_filter_non-en.yaml
+9-9 b/‎config/heuristic_filter_non-en.yaml
+9-9
diff --git a/‎config/lm_tasks.yaml
+1-1 b/‎config/lm_tasks.yaml
+1-1
diff --git a/‎config/pii_config.yaml
+1-1 b/‎config/pii_config.yaml
+1-1
diff --git a/‎config/wikipedia_builder.yaml
+1-1 b/‎config/wikipedia_builder.yaml
+1-1
diff --git a/‎docs/user-guide/CPUvsGPU.rst
+1-1 b/‎docs/user-guide/CPUvsGPU.rst
+1-1
diff --git a/‎docs/user-guide/DistributedDataClassification.rst
+2-2 b/‎docs/user-guide/DistributedDataClassification.rst
+2-2
diff --git a/‎docs/user-guide/DocumentDataset.rst
+4-4 b/‎docs/user-guide/DocumentDataset.rst
+4-4
@@ -40,5 +40,3 @@ jobs:
         # TODO: Remove env variable when gpu dependencies are optional
         run: |
           RAPIDS_NO_INITIALIZE=1 python -m pytest -v --cpu
-
-
@@ -0,0 +1,47 @@
+# Copyright (c) 2024, NVIDIA CORPORATION.  All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+default_language_version:
+  python: python3
+
+ci:
+  autofix_prs: true
+  autoupdate_commit_msg: '[pre-commit.ci] pre-commit suggestions'
+  autoupdate_schedule: quarterly
+
+repos:
+  - repo: https://github.com/pre-commit/pre-commit-hooks
+    rev: v4.5.0
+    hooks:
+      - id: check-added-large-files
+        args: ['--maxkb=1000']
+      - id: check-case-conflict
+      - id: check-yaml
+      - id: detect-private-key
+      - id: end-of-file-fixer
+      - id: requirements-txt-fixer
+      - id: trailing-whitespace
+
+  - repo: https://github.com/psf/black
+    rev: 24.3.0
+    hooks:
+      - id: black
+        name: Format code
+
+  - repo: https://github.com/PyCQA/isort
+    rev: 5.13.2
+    hooks:
+      - id: isort
+        name: Format imports
+        exclude: docs/
@@ -52,7 +52,7 @@ We use ``black`` as our style guide. To fix your format run `pip install pre-com
 1. Minimize the use of ``**kwargs``.
 1. ``RaiseError`` is preferred to ``assert``. Write: ```if X: raise Error``` instead of ```assert X```.
 1. Classes are preferred to standalone methods.
-1. Methods should be atomic. A method shouldn't be longer than 75 lines, e.g. can be fit into the computer screen without scrolling.
+1. Methods should be atomic. A method shouldn't be longer than 88 lines, e.g. can be fit into the computer screen without scrolling.
 1. If a method has arguments that don't fit into one line, each argument should be in its own line for readability.
 1. Add ``__init__.py`` for every folder.
 1. F-strings are prefered to formatted strings.
 
@@ -14,7 +14,7 @@ We currently support the following data-curation modules. For more details on ea
  - [Text reformatting and cleaning](docs/user-guide/LanguageIdentificationUnicodeFormatting.rst)
    - Fix unicode decoding errors via [ftfy](https://ftfy.readthedocs.io/en/latest/)
  - [Quality filtering](docs/user-guide/QualityFiltering.rst)
-   - Multilingual heuristic-based filtering 
+   - Multilingual heuristic-based filtering
    - Classifier-based filtering via [fastText](https://fasttext.cc/)
  - [Document-level deduplication](docs/user-guide/GpuDeduplication.rst)
    - Both exact and fuzzy deduplication are accelerated using cuDF and Dask.
@@ -79,7 +79,7 @@ Note: This is not the only way to run NeMo Curator on SLURM. There are example s
 
 ## Module Ablation and Compute Performance
 
-The modules within NeMo Curator were in large part designed to curate high-quality documents from Common Crawl snapshots and to be able to do so 
+The modules within NeMo Curator were in large part designed to curate high-quality documents from Common Crawl snapshots and to be able to do so
 in a scalable manner. In order to assess the quality of the Common Crawl documents curated by the modules in NeMo Curator, we performed a series
 of ablation experiments in which we trained a 357M-parameter GPT-style model on the datasets resulting from the different stages of our data curation
 pipeline implemented in NeMo Curator. The figure below demonstrates that the different data curation modules implemented within NeMo Curator
@@ -89,7 +89,7 @@ lead to improved model zero-shot downstream task performance.
   <img src="./docs/user-guide/images/zeroshot_ablations.png" alt="drawing" width="700"/>
 </p>
 
-In terms of scalability and compute performance, using the RAPIDS + Dask fuzzy deduplication, we are able to deduplicate the 1.1 Trillion token Red Pajama dataset in 1.8 hours using 64 A100s. 
+In terms of scalability and compute performance, using the RAPIDS + Dask fuzzy deduplication, we are able to deduplicate the 1.1 Trillion token Red Pajama dataset in 1.8 hours using 64 A100s.
 
 Additionally, using the CPU-based modules the table below shows the time required and resulting data size reduction of each step of processing the [Common Crawl snapshot from November/December of 2020](https://commoncrawl.org/2020/12/nov-dec-2020-crawl-archive-now-available/) using 30 CPU nodes (with hardware similar to the `c5.24xlarge` [Amazon AWS C5 instance](https://aws.amazon.com/ec2/instance-types/c5/)):
 
@@ -128,4 +128,4 @@ Additionally, using the CPU-based modules the table below shows the time require
 
 As mentioned above, the modules within NeMo Curator enable users to scale data-mining and NLP processing tasks to many nodes within a compute cluster.
 The modules accomplish this using [Dask](https://www.dask.org/) with [cuDF](https://docs.rapids.ai/api/cudf/nightly/user_guide/10min/) (for the GPU-accelerated modules).
-At the core of the NeMo Curator, `DocumentDataset` (the main dataset class) is just a simple wrapper around a Dask dataframe. Dask allows NeMo Curator to scale to arbitrary cluster sizes, and it supports a variety of distributed computing platforms. It supports reading and writing to different file formats, and it can balance these operations among nodes in the cluster. Importantly, Dask also supports the RAPIDS cuDF library for GPU-acclerated exact and fuzzy deduplication.
+At the core of the NeMo Curator, `DocumentDataset` (the main dataset class) is just a simple wrapper around a Dask dataframe. Dask allows NeMo Curator to scale to arbitrary cluster sizes, and it supports a variety of distributed computing platforms. It supports reading and writing to different file formats, and it can balance these operations among nodes in the cluster. Importantly, Dask also supports the RAPIDS cuDF library for GPU-acclerated exact and fuzzy deduplication.
@@ -21,4 +21,4 @@ While NVIDIA currently does not have a bug bounty program, we do offer acknowled
 
 ## NVIDIA Product Security
 
-For all security-related concerns, please visit NVIDIA's Product Security portal at https://www.nvidia.com/en-us/security
+For all security-related concerns, please visit NVIDIA's Product Security portal at https://www.nvidia.com/en-us/security
@@ -1,11 +1,11 @@
 download_module: nemo_curator.download.arxiv.ArxivDownloader
 download_params: {}
 iterator_module: nemo_curator.download.arxiv.ArxivIterator
-iterator_params: 
+iterator_params:
   log_frequency: 1000
 extract_module: nemo_curator.download.arxiv.ArxivExtractor
 extract_params: {}
 format:
   text: str
   id: str
-  source_id: str
+  source_id: str
@@ -9,4 +9,4 @@ format:
   language: str
   url: str
   warc_id: str
-  source_id: str
+  source_id: str
@@ -1,7 +1,7 @@
 input_field: text
 filters:
   # The filters below define a chain of heuristic filters to be applied to each document in a corpus.
-  # This particular cascade of filters is intended to filter Python code data. 
+  # This particular cascade of filters is intended to filter Python code data.
   # The filter listed at the top will be applied first, and the following filters will be applied in
   # the order they appear in this file. Each filter can be removed and re-ordered as desired.
   # Change this based on the language of the data
 
@@ -1,7 +1,7 @@
 input_field: text
 filters:
   # The filters below define a chain of heuristic filters to be applied to each document in a corpus.
-  # This particular cascade of filters is intended to filter English language data. 
+  # This particular cascade of filters is intended to filter English language data.
   # The filter listed at the top will be applied first, and the following filters will be applied in
   # the order they appear in this file. Each filter can be removed and re-ordered as desired.
   - name: nemo_curator.filters.heuristic_filter.NonAlphaNumericFilter
@@ -14,16 +14,16 @@ filters:
     params:
       max_number_to_text_ratio: 0.15
   - name: nemo_curator.filters.heuristic_filter.UrlsFilter
-    params: 
+    params:
       max_url_to_text_ratio: 0.2
   - name: nemo_curator.filters.heuristic_filter.WhiteSpaceFilter
-    params: 
+    params:
       max_white_space_ratio: 0.25
   - name: nemo_curator.filters.heuristic_filter.ParenthesesFilter
-    params: 
+    params:
       max_parentheses_ratio: 0.1
   - name: nemo_curator.filters.heuristic_filter.BoilerPlateStringFilter
-    params: 
+    params:
       remove_if_at_top_or_bottom: True
       max_boilerplate_string_ratio: 0.4
   - name: nemo_curator.filters.heuristic_filter.RepeatedLinesFilter
@@ -46,18 +46,18 @@ filters:
     params:
       max_num_sentences_without_endmark_ratio: 0.85
   - name: nemo_curator.filters.heuristic_filter.WordsWithoutAlphabetsFilter
-    params: 
+    params:
       min_words_with_alphabets: 0.8
   - name: nemo_curator.filters.heuristic_filter.CommonEnglishWordsFilter
     params:
       min_num_common_words: 2
       stop_at_false: True
   - name: nemo_curator.filters.heuristic_filter.MeanWordLengthFilter
     params:
-      max_mean_word_length: 10 
+      max_mean_word_length: 10
       min_mean_word_length: 3
   - name: nemo_curator.filters.heuristic_filter.LongWordFilter
-    params: 
+    params:
       max_word_length: 1000
   - name: nemo_curator.filters.heuristic_filter.EllipsisFilter
     params:
@@ -102,4 +102,4 @@ filters:
       max_repeating_duplicate_ngram_ratio: 0.10
   - name: nemo_curator.filters.heuristic_filter.BulletsFilter
     params:
-      max_bullet_lines_ratio: 0.9
+      max_bullet_lines_ratio: 0.9
@@ -1,7 +1,7 @@
 input_field: text
 filters:
   # The filters below define a chain of heuristic filters to be applied to each document in a corpus.
-  # This particular cascade of filters is intended to filter generic non-English data that use spaces for separating words. 
+  # This particular cascade of filters is intended to filter generic non-English data that use spaces for separating words.
   # The filter listed at the top will be applied first, and the following filters will be applied in
   # the order they appear in this file. Each filter can be removed and re-ordered as desired.
   - name: nemo_curator.filters.heuristic_filter.SymbolsToWordsFilter
@@ -11,16 +11,16 @@ filters:
     params:
       max_number_to_text_ratio: 0.15
   - name: nemo_curator.filters.heuristic_filter.UrlsFilter
-    params: 
+    params:
       max_url_to_text_ratio: 0.2
   - name: nemo_curator.filters.heuristic_filter.WhiteSpaceFilter
-    params: 
+    params:
       max_white_space_ratio: 0.25
   - name: nemo_curator.filters.heuristic_filter.ParenthesesFilter
-    params: 
+    params:
       max_parentheses_ratio: 0.1
   - name: nemo_curator.filters.heuristic_filter.BoilerPlateStringFilter
-    params: 
+    params:
       remove_if_at_top_or_bottom: True
       max_boilerplate_string_ratio: 0.4
   - name: nemo_curator.filters.heuristic_filter.RepeatedLinesFilter
@@ -39,17 +39,17 @@ filters:
     params:
       min_words: 50
       max_words: 100000
-  # NOTE: This filter tends to remove many documents and will need to 
+  # NOTE: This filter tends to remove many documents and will need to
   # be tuned per language
   - name: nemo_curator.filters.heuristic_filter.PunctuationFilter
     params:
       max_num_sentences_without_endmark_ratio: 0.85
   - name: nemo_curator.filters.heuristic_filter.MeanWordLengthFilter
     params:
-      max_mean_word_length: 10 
+      max_mean_word_length: 10
       min_mean_word_length: 3
   - name: nemo_curator.filters.heuristic_filter.LongWordFilter
-    params: 
+    params:
       max_word_length: 1000
   - name: nemo_curator.filters.heuristic_filter.EllipsisFilter
     params:
@@ -94,4 +94,4 @@ filters:
       max_repeating_duplicate_ngram_ratio: 0.10
   - name: nemo_curator.filters.heuristic_filter.BulletsFilter
     params:
-      max_bullet_lines_ratio: 0.9
+      max_bullet_lines_ratio: 0.9
@@ -1,6 +1,6 @@
 tasks:
   # The Python modules below define language model downstream evaluation
-  # task data. If one of the below tasks is specified, N-grams will 
+  # task data. If one of the below tasks is specified, N-grams will
   # be constructed from the documents that make up the task data
   # using the script prepare_task_data.
   # find_matching_ngrams will then search for these N-grams
 
@@ -13,4 +13,4 @@ pii_config:
         #type: 'hash'
         #hash_type: 'sha256'
 
-        #type: 'redact'
+        #type: 'redact'
@@ -12,4 +12,4 @@ format:
   id: str
   url: str
   language: str
-  source_id: str
+  source_id: str
@@ -95,4 +95,4 @@ Every SLURM cluster is different, so make sure you understand how your SLURM clu
 ``start-slurm.sh`` calls ``containter-entrypoint.sh`` which sets up a Dask scheduler and workers across the cluster.
 
 Our Python examples are designed to work such that they can be run locally on their own, or easily substituted into the ``start-slurm.sh`` to run on multiple nodes.
-You can adapt your scripts easily too by simply following the pattern of adding ``get_client`` with ``add_distributed_args``.
+You can adapt your scripts easily too by simply following the pattern of adding ``get_client`` with ``add_distributed_args``.
@@ -8,7 +8,7 @@ Background
 
 When preparing text data to be used in training a large language model (LLM), it is useful to classify
 text documents in various ways, to enhance the LLM's performance by making it able to produce more
-contextually appropriate and accurate language across various subjects. NeMo Curator provides this module to 
+contextually appropriate and accurate language across various subjects. NeMo Curator provides this module to
 help a user run inference with pre-trained models on large amounts of text documents. We achieve
 this by chunking the datasets across multiple computing nodes, each equipped with multiple GPUs, to
 accelerate the classification task in a distributed way. In other words, because the classification of
@@ -68,4 +68,4 @@ The key differences is that it operates on the GPU instead of the CPU.
 Therefore, the Dask cluster must be started as a GPU one.
 And, ``DomainClassifier`` requires ``DocumentDataset`` to be on the GPU (i.e., have ``backend=cudf``).
 It is easy to extend ``DistributedDataClassifier`` to your own model.
-Check out ``nemo_curator.modules.distributed_data_classifier.py`` for reference.
+Check out ``nemo_curator.modules.distributed_data_classifier.py`` for reference.
@@ -48,7 +48,7 @@ You could read, filter the dataset, and write it using the following methods
                     text_field="text",
                     score_field="word_count",
                 )
-    
+
     long_books = filter_step(books)
 
     long_books.to_json("long_books/", write_to_filename=True)
@@ -106,7 +106,7 @@ Consider a modified version of the code above:
                     text_field="text",
                     score_field="word_count",
                 )
-    
+
     long_books = filter_step(books)
 
     long_books.to_json("long_books/", write_to_filename=True)
@@ -130,10 +130,10 @@ In these cases, we recommend processing the input dataset in batches using a sim
                         text_field="text",
                         score_field="word_count",
                     )
-        
+
         long_books = filter_step(books)
 
         long_books.to_json("long_books/", write_to_filename=True)
 
 This will read in 64 shards at a time, process them, and write them back to disk.
-Like ``get_remaining_files``, it only includes files that are in the input directory and not in the output directory.
+Like ``get_remaining_files``, it only includes files that are in the input directory and not in the output directory.
Original file line number	Diff line number	Diff line change
`@@ -21,4 +21,4 @@ While NVIDIA currently does not have a bug bounty program, we do offer acknowled`
`21`	`21`
`22`	`22`	`## NVIDIA Product Security`
`23`	`23`
`24`		`-For all security-related concerns, please visit NVIDIA's Product Security portal at https://www.nvidia.com/en-us/security`
	`24`	`+For all security-related concerns, please visit NVIDIA's Product Security portal at https://www.nvidia.com/en-us/security`