Skip to content

SemDedup outputs docs_to_remove + All dedup modules extend BaseDeduplicationModule#617

Merged
praateekmahajan merged 8 commits intoNVIDIA-NeMo:mainfrom
praateekmahajan:praateek/unify-dedups
Apr 2, 2025
Merged

SemDedup outputs docs_to_remove + All dedup modules extend BaseDeduplicationModule#617
praateekmahajan merged 8 commits intoNVIDIA-NeMo:mainfrom
praateekmahajan:praateek/unify-dedups

Conversation

@praateekmahajan
Copy link
Contributor

Description

Closes #516

Usage

# Add snippet demonstrating usage

Checklist

  • I am familiar with the Contributing Guide.
  • New or Existing tests cover these changes.
  • The documentation is up to date with these changes.

@praateekmahajan praateekmahajan changed the title SemDedup outputs docs_to_remove + All dedup modules extend BaseDeduplicationModule #615 SemDedup outputs docs_to_remove + All dedup modules extend BaseDeduplicationModule Apr 1, 2025
@praateekmahajan praateekmahajan requested review from VibhuJawa, ayushdg and sarahyurick and removed request for VibhuJawa April 1, 2025 23:16
@sarahyurick sarahyurick added the gpuci Run GPU CI/CD on PR label Apr 1, 2025
Copy link
Contributor

@sarahyurick sarahyurick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nothing was changed from #615 right? Assuming yes and that GPU CI passes, then LGTM.

Copy link
Contributor

@VibhuJawa VibhuJawa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for cleaning up and aligning all these classes. This is helping us reduce our tech debt a lot.

Left a minor review to add tests for both code paths of removing and adding these ids.

Signed-off-by: Praateek <praateekm@gmail.com>
Signed-off-by: Praateek <praateekm@gmail.com>
Signed-off-by: Praateek <praateekm@gmail.com>
@praateekmahajan praateekmahajan force-pushed the praateek/unify-dedups branch from 7a794a7 to e905655 Compare April 2, 2025 18:09
Signed-off-by: Praateek <praateekm@gmail.com>
Signed-off-by: Praateek <praateekm@gmail.com>
@ayushdg ayushdg added gpuci Run GPU CI/CD on PR and removed gpuci Run GPU CI/CD on PR labels Apr 2, 2025
Signed-off-by: Praateek <praateekm@gmail.com>
Copy link
Contributor

@Maghoumi Maghoumi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed the tutorial changes. Look good to me.

config=semdedup_config,
input_column="text",
id_column="id",
perform_removal=True,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh that's very neat! Was this recently added to the SemDedup API?

Also, do other dedups offer a similar feature?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is being added to SemDedup in this PR itself, and after this PR all dedups will offer perform_removal (see #516)

Signed-off-by: Praateek <praateekm@gmail.com>
Copy link
Contributor

@VibhuJawa VibhuJawa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Signed-off-by: Praateek <praateekm@gmail.com>
@praateekmahajan praateekmahajan merged commit 2a186d5 into NVIDIA-NeMo:main Apr 2, 2025
3 of 5 checks passed
jnke2016 pushed a commit to jnke2016/Curator that referenced this pull request Nov 12, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

gpuci Run GPU CI/CD on PR

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Unifying Deduplication API Modules

6 participants