@@ -58,24 +58,108 @@ steps (all scripts are included in the :code:`nemo_curator/scripts/` subdirector
58
58
2. Output: _exact_duplicates.parquet. List of exact duplicates and the document hash.
59
59
60
60
* Fuzzy Dedup
61
- 1. Minhashes (Compute minhashes)
62
- 1. Input: Data Directories
63
- 2. Output: minhashes.parquet for each data dir.
64
- 2. Buckets (Minhash Buckets/LSH)
65
- 1. Input: Minhash directories
66
- 2. Output: _buckets.parquet
67
- 3. Map Buckets
68
- 1. Input: Buckets.parquet + Data Dirs
69
- 2. Output: anchor_docs_with_bk.parquet
70
- 4. Jaccard Shuffle
71
- 1. Input: anchor_docs_with_bk.parquet + Data Dirs
72
- 2. Output: shuffled_docs.parquet
73
- 5. Jaccard compute
74
- 1. Input: Shuffled docs.parquet
75
- 2. Output: jaccard_similarity_results.parquet
76
- 6. Connected Components
77
- 1. Input: jaccard_similarity_results.parquet
78
- 2. Output: connected_components.parquet
61
+
62
+ 1. Compute Minhashes
63
+ - Input: Data Directories
64
+ - Output: minhashes.parquet for each data dir.
65
+ - Example call:
66
+
67
+ .. code-block :: bash
68
+
69
+ # same as `python compute_minhashes.py`
70
+ gpu_compute_minhashes \
71
+ --input-data-dirs /path/to/jsonl/dir1 /path/to/jsonl/dir2 \
72
+ --output-minhash-dir /path/to/output_minhashes \
73
+ --input-json-text-field text_column_name \
74
+ --input-json-id-field id_column_name \
75
+ --minhash-length number_of_hashes \
76
+ --char-ngram char_ngram_size \
77
+ --hash-bytes 4(or 8 byte hashes) \
78
+ --seed 42 \
79
+ --log-dir ./
80
+ # --scheduler-file /path/to/file.json
81
+
82
+
83
+ 2. Buckets (Minhash Buckets)
84
+ - Input: Minhash directories
85
+ - Output: Buckets.parquet
86
+ - Example call:
87
+
88
+ .. code-block :: bash
89
+
90
+ # same as `python minhash_lsh.py`
91
+ minhash_buckets \
92
+ --input-data-dirs /path/to/output_minhashes/dir1 /path/to/output_minhashes/dir2 \
93
+ --output-bucket-dir /path/to/dedup_output \
94
+ --input-minhash-field _minhash_signature \
95
+ --input-json-id-field id_column_name \
96
+ --minhash-length number_of_hashes \
97
+ --num-bands num_bands \
98
+ --buckets-per-shuffle 1 ` # Value b/w [1-num_bands]. Higher is better but might lead to oom` \
99
+ --log-dir ./
100
+ # --scheduler-file /path/to/file.json
101
+
102
+ 3. Jaccard Map Buckets
103
+ - Input: Buckets.parquet + Data Dir
104
+ - Output: anchor_docs_with_bk.parquet
105
+ - Example call:
106
+
107
+ .. code-block :: bash
108
+
109
+ # same as `python map_buckets.py`
110
+ jaccard_map_buckets \
111
+ --input-data-dirs /path/to/jsonl/dir1 /path/to/jsonl/dir2 \
112
+ --input-bucket-dir /path/to/dedup_output/_buckets.parquet \
113
+ --output-dir /path/to/dedup_output \
114
+ --input-json-text-field text_column_name \
115
+ --input-json-id-field id_column_name \
116
+ # --scheduler-file /path/to/file.json
117
+
118
+ 4. Jaccard Shuffle
119
+ - Input: anchor_docs_with_bk.parquet + Data Dir
120
+ - Output: shuffled_docs.parquet
121
+ - Example call:
122
+
123
+ .. code-block :: bash
124
+
125
+ # same as `python jaccard_shuffle.py`
126
+ jaccard_shuffle \
127
+ --input-data-dirs /path/to/jsonl/dir1 /path/to/jsonl/dir2 \
128
+ --input-bucket-mapping-dir /path/to/dedup_output/anchor_docs_with_bk.parquet \
129
+ --output-dir /path/to/dedup_output \
130
+ --input-json-text-field text_column_name \
131
+ --input-json-id-field id_column_name \
132
+ # --scheduler-file /path/to/file.json
133
+
134
+ 5. Jaccard compute
135
+ - Input: Shuffled docs.parquet
136
+ - Output: jaccard_similarity_results.parquet
137
+ - Example call:
138
+
139
+ .. code-block :: bash
140
+
141
+ # same as `python jaccard_compute.py`
142
+ jaccard_compute \
143
+ --shuffled-docs-path /path/to/dedup_output/shuffled_docs.parquet \
144
+ --output-dir /path/to/dedup_output \
145
+ --ngram-size char_ngram_size_for_similarity \
146
+ # --scheduler-file /path/to/file.json
147
+
148
+ 6. Connected Components
149
+ - Input: jaccard_similarity_results.parquet
150
+ - Output: connected_components.parquet
151
+ - Example call:
152
+
153
+ .. code-block :: bash
154
+
155
+ # same as `python connected_components.py`
156
+ gpu_connected_component \
157
+ --jaccard-pairs_path /path/to/dedup_output/jaccard_similarity_results.parquet \
158
+ --output-dir /path/to/dedup_output \
159
+ --cache-dir /path/to/cc_cache \
160
+ --jaccard-threshold 0.8
161
+ # --scheduler-file /path/to/file.json
162
+
79
163
80
164
In addition to the scripts, there are examples in the `examples ` directory that showcase using the python module
81
165
directly in your own code. It also has examples on how to remove documents from the corpus using the list of duplicate IDs generated from exact or fuzzy
0 commit comments