Skip to content

Commit e654281

Browse files
authored
Fix a few bugs in fuzzy dedup and docs (NVIDIA#156)
* Fix arg type Signed-off-by: Ryan Wolf <rywolf@nvidia.com> * Fix arg in docs Signed-off-by: Ryan Wolf <rywolf@nvidia.com> --------- Signed-off-by: Ryan Wolf <rywolf@nvidia.com>
1 parent 9c50fb0 commit e654281

File tree

2 files changed

+2
-2
lines changed

2 files changed

+2
-2
lines changed

docs/user-guide/gpudeduplication.rst

+1-1
Original file line numberDiff line numberDiff line change
@@ -154,7 +154,7 @@ steps (all scripts are included in the :code:`nemo_curator/scripts/` subdirector
154154
155155
# same as `python connected_components.py`
156156
gpu_connected_component \
157-
--jaccard-pairs_path /path/to/dedup_output/jaccard_similarity_results.parquet \
157+
--jaccard-pairs-path /path/to/dedup_output/jaccard_similarity_results.parquet \
158158
--output-dir /path/to/dedup_output \
159159
--cache-dir /path/to/cc_cache \
160160
--jaccard-threshold 0.8

nemo_curator/scripts/fuzzy_deduplication/connected_components.py

+1-1
Original file line numberDiff line numberDiff line change
@@ -67,7 +67,7 @@ def attach_args(parser=None):
6767
)
6868
parser.add_argument(
6969
"--jaccard-threshold",
70-
type=int,
70+
type=float,
7171
default=0.8,
7272
help="Jaccard threshold below which we don't consider documents"
7373
" to be duplicate",

0 commit comments

Comments
 (0)