You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
* Updates for pre-commit CI tests, add black, isort and other pre commit configs
Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com>
* Fix circular imports
Co-authored-by: Ryan Wolf <rywolf@nvidia.com>
Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com>
[pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
Add copyright & update py_version to 310
Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com>
---------
Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com>
In terms of scalability and compute performance, using the RAPIDS + Dask fuzzy deduplication, we are able to deduplicate the 1.1 Trillion token Red Pajama dataset in 1.8 hours using 64 A100s.
92
+
In terms of scalability and compute performance, using the RAPIDS + Dask fuzzy deduplication, we are able to deduplicate the 1.1 Trillion token Red Pajama dataset in 1.8 hours using 64 A100s.
93
93
94
94
Additionally, using the CPU-based modules the table below shows the time required and resulting data size reduction of each step of processing the [Common Crawl snapshot from November/December of 2020](https://commoncrawl.org/2020/12/nov-dec-2020-crawl-archive-now-available/) using 30 CPU nodes (with hardware similar to the `c5.24xlarge`[Amazon AWS C5 instance](https://aws.amazon.com/ec2/instance-types/c5/)):
95
95
@@ -128,4 +128,4 @@ Additionally, using the CPU-based modules the table below shows the time require
128
128
129
129
As mentioned above, the modules within NeMo Curator enable users to scale data-mining and NLP processing tasks to many nodes within a compute cluster.
130
130
The modules accomplish this using [Dask](https://www.dask.org/) with [cuDF](https://docs.rapids.ai/api/cudf/nightly/user_guide/10min/) (for the GPU-accelerated modules).
131
-
At the core of the NeMo Curator, `DocumentDataset` (the main dataset class) is just a simple wrapper around a Dask dataframe. Dask allows NeMo Curator to scale to arbitrary cluster sizes, and it supports a variety of distributed computing platforms. It supports reading and writing to different file formats, and it can balance these operations among nodes in the cluster. Importantly, Dask also supports the RAPIDS cuDF library for GPU-acclerated exact and fuzzy deduplication.
131
+
At the core of the NeMo Curator, `DocumentDataset` (the main dataset class) is just a simple wrapper around a Dask dataframe. Dask allows NeMo Curator to scale to arbitrary cluster sizes, and it supports a variety of distributed computing platforms. It supports reading and writing to different file formats, and it can balance these operations among nodes in the cluster. Importantly, Dask also supports the RAPIDS cuDF library for GPU-acclerated exact and fuzzy deduplication.
Copy file name to clipboardexpand all lines: docs/user-guide/CPUvsGPU.rst
+1-1
Original file line number
Diff line number
Diff line change
@@ -95,4 +95,4 @@ Every SLURM cluster is different, so make sure you understand how your SLURM clu
95
95
``start-slurm.sh`` calls ``containter-entrypoint.sh`` which sets up a Dask scheduler and workers across the cluster.
96
96
97
97
Our Python examples are designed to work such that they can be run locally on their own, or easily substituted into the ``start-slurm.sh`` to run on multiple nodes.
98
-
You can adapt your scripts easily too by simply following the pattern of adding ``get_client`` with ``add_distributed_args``.
98
+
You can adapt your scripts easily too by simply following the pattern of adding ``get_client`` with ``add_distributed_args``.
0 commit comments