Skip to content

Commit fe9fd6f

Browse files
authored
Move common dedup utils and remove unused code (NVIDIA#42)
* Refactor common utils and remove unused code Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com> * More cleanup Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com> * More updates/shuffling Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com> * Move gpu_dedup scripts into subfolder Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com> * Remove gpu_deduplication subfolder Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com> * Add readme to fuzzy dedup scripts section Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com> * Fix typo and relative links Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com> * Remove legacy script entrypoints Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com> * Remove legacy scripts and add init file Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com> * Update GpuDeduplication.rst Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com> --------- Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com>
1 parent f4355af commit fe9fd6f

33 files changed

+302
-2443
lines changed

docs/user-guide/GpuDeduplication.rst

+102-18
Original file line numberDiff line numberDiff line change
@@ -58,24 +58,108 @@ steps (all scripts are included in the :code:`nemo_curator/scripts/` subdirector
5858
2. Output: _exact_duplicates.parquet. List of exact duplicates and the document hash.
5959

6060
* Fuzzy Dedup
61-
1. Minhashes (Compute minhashes)
62-
1. Input: Data Directories
63-
2. Output: minhashes.parquet for each data dir.
64-
2. Buckets (Minhash Buckets/LSH)
65-
1. Input: Minhash directories
66-
2. Output: _buckets.parquet
67-
3. Map Buckets
68-
1. Input: Buckets.parquet + Data Dirs
69-
2. Output: anchor_docs_with_bk.parquet
70-
4. Jaccard Shuffle
71-
1. Input: anchor_docs_with_bk.parquet + Data Dirs
72-
2. Output: shuffled_docs.parquet
73-
5. Jaccard compute
74-
1. Input: Shuffled docs.parquet
75-
2. Output: jaccard_similarity_results.parquet
76-
6. Connected Components
77-
1. Input: jaccard_similarity_results.parquet
78-
2. Output: connected_components.parquet
61+
62+
1. Compute Minhashes
63+
- Input: Data Directories
64+
- Output: minhashes.parquet for each data dir.
65+
- Example call:
66+
67+
.. code-block:: bash
68+
69+
# same as `python compute_minhashes.py`
70+
gpu_compute_minhashes \
71+
--input-data-dirs /path/to/jsonl/dir1 /path/to/jsonl/dir2 \
72+
--output-minhash-dir /path/to/output_minhashes \
73+
--input-json-text-field text_column_name \
74+
--input-json-id-field id_column_name \
75+
--minhash-length number_of_hashes \
76+
--char-ngram char_ngram_size \
77+
--hash-bytes 4(or 8 byte hashes) \
78+
--seed 42 \
79+
--log-dir ./
80+
# --scheduler-file /path/to/file.json
81+
82+
83+
2. Buckets (Minhash Buckets)
84+
- Input: Minhash directories
85+
- Output: Buckets.parquet
86+
- Example call:
87+
88+
.. code-block:: bash
89+
90+
# same as `python minhash_lsh.py`
91+
minhash_buckets \
92+
--input-data-dirs /path/to/output_minhashes/dir1 /path/to/output_minhashes/dir2 \
93+
--output-bucket-dir /path/to/dedup_output \
94+
--input-minhash-field _minhash_signature \
95+
--input-json-id-field id_column_name \
96+
--minhash-length number_of_hashes \
97+
--num-bands num_bands \
98+
--buckets-per-shuffle 1 `#Value b/w [1-num_bands]. Higher is better but might lead to oom` \
99+
--log-dir ./
100+
# --scheduler-file /path/to/file.json
101+
102+
3. Jaccard Map Buckets
103+
- Input: Buckets.parquet + Data Dir
104+
- Output: anchor_docs_with_bk.parquet
105+
- Example call:
106+
107+
.. code-block:: bash
108+
109+
# same as `python map_buckets.py`
110+
jaccard_map_buckets \
111+
--input-data-dirs /path/to/jsonl/dir1 /path/to/jsonl/dir2 \
112+
--input-bucket-dir /path/to/dedup_output/_buckets.parquet \
113+
--output-dir /path/to/dedup_output \
114+
--input-json-text-field text_column_name \
115+
--input-json-id-field id_column_name \
116+
# --scheduler-file /path/to/file.json
117+
118+
4. Jaccard Shuffle
119+
- Input: anchor_docs_with_bk.parquet + Data Dir
120+
- Output: shuffled_docs.parquet
121+
- Example call:
122+
123+
.. code-block:: bash
124+
125+
# same as `python jaccard_shuffle.py`
126+
jaccard_shuffle \
127+
--input-data-dirs /path/to/jsonl/dir1 /path/to/jsonl/dir2 \
128+
--input-bucket-mapping-dir /path/to/dedup_output/anchor_docs_with_bk.parquet \
129+
--output-dir /path/to/dedup_output \
130+
--input-json-text-field text_column_name \
131+
--input-json-id-field id_column_name \
132+
# --scheduler-file /path/to/file.json
133+
134+
5. Jaccard compute
135+
- Input: Shuffled docs.parquet
136+
- Output: jaccard_similarity_results.parquet
137+
- Example call:
138+
139+
.. code-block:: bash
140+
141+
# same as `python jaccard_compute.py`
142+
jaccard_compute \
143+
--shuffled-docs-path /path/to/dedup_output/shuffled_docs.parquet \
144+
--output-dir /path/to/dedup_output \
145+
--ngram-size char_ngram_size_for_similarity \
146+
# --scheduler-file /path/to/file.json
147+
148+
6. Connected Components
149+
- Input: jaccard_similarity_results.parquet
150+
- Output: connected_components.parquet
151+
- Example call:
152+
153+
.. code-block:: bash
154+
155+
# same as `python connected_components.py`
156+
gpu_connected_component \
157+
--jaccard-pairs_path /path/to/dedup_output/jaccard_similarity_results.parquet \
158+
--output-dir /path/to/dedup_output \
159+
--cache-dir /path/to/cc_cache \
160+
--jaccard-threshold 0.8
161+
# --scheduler-file /path/to/file.json
162+
79163
80164
In addition to the scripts, there are examples in the `examples` directory that showcase using the python module
81165
directly in your own code. It also has examples on how to remove documents from the corpus using the list of duplicate IDs generated from exact or fuzzy

examples/gpu_deduplication_example/README.md

+3
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,8 @@
11
### Deduplication Steps
22

3+
> [!CAUTION]
4+
> The examples references here are outdated and will be replaced with an example using the Python API directly. For more details on the scripts refer to [nemo_curator/scripts/fuzzy_deduplication](/nemo_curator/scripts/fuzzy_deduplication)
5+
36
1. Exact dedup
47
1. Input: Data directories
58
2. Output: exact_duplicates.parquet. List of exact duplicates and the document hash.

nemo_curator/gpu_deduplication/__init__.py

-13
This file was deleted.

0 commit comments

Comments
 (0)