Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue 494: Add deduplication logic to extract_xri script #518

Merged
merged 1 commit into from
Sep 19, 2024

Conversation

rminsil
Copy link
Collaborator

@rminsil rminsil commented Sep 13, 2024

This PR adds duplication detection and deduplication logic to the script motivated by this requirement from comment #494 (comment):

  • Duplicate LWC entries
    • Log instances at error level

Duplicate entries are identified and logged at error level and the script removes duplicates, choosing the match that has target sentence closest in length to the source sentence.


Example tsv:

id	source	target	split
0	duplicate-source1	duplicate-target-1  	brain
2	duplicate-source1	!	dev
3	duplicate-source2	short-1  	brain
1	duplicate-source1	duplicate-target-2	dev
4	duplicate-source2	duplicate-target-2	dev
5	duplicate-source3	duplicate-target-2	dev
6	Duplicate-source3	duplicate-target-2	dev

Note that there's two pairs of duplicates:

  • id's 0 and 1
  • id's 3 and 4

This leads to logs:

INFO - Starting deduplication stage
WARNING - 2 duplicate sentence pairs found with source: 'duplicate-source1'. Id's are [0, 1]. Id 0 chosen with target: 'duplicate-target-1'
WARNING - 2 duplicate sentence pairs found with source: 'duplicate-source2'. Id's are [3, 4]. Id 4 chosen with target: 'duplicate-target-2'
INFO - Finished deduplication stage. 6 pairs passed in. 4 pairs returned. 2 duplicate pairs removed.

4 is closer in length, so it's chosen over 3.
0 and 1 tie, so 0 is chosen as it's earlier.

I deliberately made the duplicates non-contiguous to show that contiguity (?) is not required.

Note that id's 5 and 6 aren't duplicates because of case sensitivity. I wasn't sure if case sensitivity is a concept that generalizes well across all languages in a way that's well defined. So I kept the logic case sensitive.


This change is Reviewable

@ddaspit
Copy link
Collaborator

ddaspit commented Sep 13, 2024

@mmartin9684-sil Are you good with this deduplication strategy?

@rminsil rminsil linked an issue Sep 19, 2024 that may be closed by this pull request
@rminsil rminsil force-pushed the issue-494-add-basic-filtering-and-cleaning branch from f5386d6 to fc2a851 Compare September 19, 2024 05:19
Base automatically changed from issue-494-add-basic-filtering-and-cleaning to master September 19, 2024 05:24
@rminsil
Copy link
Collaborator Author

rminsil commented Sep 19, 2024

Hi @ddaspit , @mmartin9684-sil - there are PR's going to back up behind this one, so I'll merge it now and then process any feedback you guys have in subsequent PR's.

@rminsil rminsil force-pushed the issue-494-remove-duplicate-translations branch from 99dc80c to 93b2bdc Compare September 19, 2024 05:27
@rminsil rminsil merged commit 51f97aa into master Sep 19, 2024
1 check was pending
@rminsil rminsil deleted the issue-494-remove-duplicate-translations branch September 19, 2024 05:27
@rminsil
Copy link
Collaborator Author

rminsil commented Oct 3, 2024

@mmartin9684-sil Are you good with this deduplication strategy?

Michael mentioned in slack that he likes the automatic deduplication so I guess it's working for him so far. If there's issues with how it's picking the right match, we can address those under #546

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add data transformation and cleaning to XRI etl script
2 participants