Issue 494: Add deduplication logic to extract_xri script #518

rminsil · 2024-09-13T07:02:54Z

This PR adds duplication detection and deduplication logic to the script motivated by this requirement from comment #494 (comment):

Duplicate LWC entries
- Log instances at error level

Duplicate entries are identified and logged at error level and the script removes duplicates, choosing the match that has target sentence closest in length to the source sentence.

Example tsv:

id	source	target	split
0	duplicate-source1	duplicate-target-1  	brain
2	duplicate-source1	!	dev
3	duplicate-source2	short-1  	brain
1	duplicate-source1	duplicate-target-2	dev
4	duplicate-source2	duplicate-target-2	dev
5	duplicate-source3	duplicate-target-2	dev
6	Duplicate-source3	duplicate-target-2	dev

Note that there's two pairs of duplicates:

id's 0 and 1
id's 3 and 4

This leads to logs:

INFO - Starting deduplication stage
WARNING - 2 duplicate sentence pairs found with source: 'duplicate-source1'. Id's are [0, 1]. Id 0 chosen with target: 'duplicate-target-1'
WARNING - 2 duplicate sentence pairs found with source: 'duplicate-source2'. Id's are [3, 4]. Id 4 chosen with target: 'duplicate-target-2'
INFO - Finished deduplication stage. 6 pairs passed in. 4 pairs returned. 2 duplicate pairs removed.

4 is closer in length, so it's chosen over 3.
0 and 1 tie, so 0 is chosen as it's earlier.

I deliberately made the duplicates non-contiguous to show that contiguity (?) is not required.

Note that id's 5 and 6 aren't duplicates because of case sensitivity. I wasn't sure if case sensitivity is a concept that generalizes well across all languages in a way that's well defined. So I kept the logic case sensitive.

This change is

ddaspit · 2024-09-13T21:33:15Z

@mmartin9684-sil Are you good with this deduplication strategy?

rminsil · 2024-09-19T05:25:43Z

Hi @ddaspit , @mmartin9684-sil - there are PR's going to back up behind this one, so I'll merge it now and then process any feedback you guys have in subsequent PR's.

rminsil · 2024-10-03T12:53:55Z

@mmartin9684-sil Are you good with this deduplication strategy?

Michael mentioned in slack that he likes the automatic deduplication so I guess it's working for him so far. If there's issues with how it's picking the right match, we can address those under #546

rminsil requested a review from ddaspit September 19, 2024 05:13

rminsil linked an issue Sep 19, 2024 that may be closed by this pull request

Add data transformation and cleaning to XRI etl script #494

Closed

rminsil force-pushed the issue-494-add-basic-filtering-and-cleaning branch from f5386d6 to fc2a851 Compare September 19, 2024 05:19

Base automatically changed from issue-494-add-basic-filtering-and-cleaning to master September 19, 2024 05:24

Issue 494: Add deduplication logic to extract_xri script

Loading
Loading status checks…

93b2bdc

rminsil force-pushed the issue-494-remove-duplicate-translations branch from 99dc80c to 93b2bdc Compare September 19, 2024 05:27

rminsil merged commit 51f97aa into master Sep 19, 2024
1 check was pending

rminsil deleted the issue-494-remove-duplicate-translations branch September 19, 2024 05:27

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Issue 494: Add deduplication logic to extract_xri script #518

Issue 494: Add deduplication logic to extract_xri script #518

rminsil commented Sep 13, 2024 •

edited by ddaspit

Loading

Uh oh!

ddaspit commented Sep 13, 2024

Uh oh!

rminsil commented Sep 19, 2024

Uh oh!

Uh oh!

rminsil commented Oct 3, 2024

Uh oh!

Issue 494: Add deduplication logic to extract_xri script #518

Issue 494: Add deduplication logic to extract_xri script #518

Conversation

rminsil commented Sep 13, 2024 • edited by ddaspit Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ddaspit commented Sep 13, 2024

Uh oh!

rminsil commented Sep 19, 2024

Uh oh!

Uh oh!

Uh oh!

rminsil commented Oct 3, 2024

Uh oh!

rminsil commented Sep 13, 2024 •

edited by ddaspit

Loading