Issue 494: Add repair logic for split lines to extract_xri script #515

rminsil · 2024-09-12T11:05:24Z

This PR relates to Michael's request on issue #494 for split lines:

Filtering or correcting split sentences

Discard sentence pairs that don't follow the expected format

Stretch goal would be to re-assemble and retain sentence pairs that are split over 2 lines because of an embedded newline

The PR addresses the stretch goal part in that it repairs split lines, for example if the input tsv was this mess:

id	source	target	split

0	source0	tar
get0	train
1	sour
ce1	tar
ge

t1	train
2	source2	target2	train

then it will process it as if it was this tsv:

id	source	target	split
0	source0	target0	train
1	source1	target1	train
2	source2	target2	train

The first request about discarding records that can't be processed hasn't been implemented. Currently the script would just crash in situations where it can't parse something rather than gracefully ignoring it. An example situation would be if there were less than 4 cells in a row. But I have at added a simple warning that would detect some of these cases.

The logic is a bit complex, you'll want to have had your morning coffee before reviewing. Basically it's like a fold that works its way through the rows accumulating the first line with any broken lines that follow. When it gets to a new line, it commits what it's accumulated so far and starts again.

Like usual, this PR is built off the back of another one that is waiting to merge.

This change is

rminsil · 2024-09-12T11:06:48Z

To help you understand the logic, here's sample logs from running it against this dummy file:

python -m silnlp.common.extract_xri simple.tsv swa ngq test -output /tmp/xri -log output.log

id	source	target	split

0	source0	tar
get0	train
1	sour
ce1	tar
ge

t1	train
2	source2	target2	train

2024-09-12 20:46:07,338,338 __main__ INFO Starting script
2024-09-12 20:46:07,338,338 __main__ INFO Loading sentence pairs
2024-09-12 20:46:07,338,338 __main__ DEBUG Opening file
2024-09-12 20:46:07,338,338 __main__ DEBUG Determining column indexes
2024-09-12 20:46:07,338,338 __main__ DEBUG Column id found at index 0
2024-09-12 20:46:07,338,338 __main__ DEBUG Column source found at index 1
2024-09-12 20:46:07,338,338 __main__ DEBUG Column target found at index 2
2024-09-12 20:46:07,338,338 __main__ DEBUG Column split found at index 3
2024-09-12 20:46:07,338,338 __main__ DEBUG Column indexes: id=0 source=1 target=2 split=3
2024-09-12 20:46:07,338,338 __main__ DEBUG Checking all rows contain 4 cells
2024-09-12 20:46:07,338,338 __main__ WARNING Not all rows contain 4 cells
2024-09-12 20:46:07,339,339 repair INFO Repair starting
2024-09-12 20:46:07,339,339 repair DEBUG Empty line detected at row index=0
2024-09-12 20:46:07,339,339 repair DEBUG Examining row index=1 id=0: ['0', 'source0', 'tar']
2024-09-12 20:46:07,339,339 repair DEBUG Can't parse id cell: 'get0' - assuming row is broken
2024-09-12 20:46:07,339,339 repair DEBUG Examining row index=2 id=None: ['get0', 'train']
2024-09-12 20:46:07,339,339 repair DEBUG Merging row 2 with data from previous row(s)
2024-09-12 20:46:07,339,339 repair DEBUG Examining row index=3 id=1: ['1', 'sour']
2024-09-12 20:46:07,339,339 repair DEBUG New row starting at index=3, committing previous row ['0', 'source0', 'target0', 'train']
2024-09-12 20:46:07,339,339 repair DEBUG Can't parse id cell: 'ce1' - assuming row is broken
2024-09-12 20:46:07,339,339 repair DEBUG Examining row index=4 id=None: ['ce1', 'tar']
2024-09-12 20:46:07,339,339 repair DEBUG Merging row 4 with data from previous row(s)
2024-09-12 20:46:07,339,339 repair DEBUG Can't parse id cell: 'ge' - assuming row is broken
2024-09-12 20:46:07,339,339 repair DEBUG Examining row index=5 id=None: ['ge']
2024-09-12 20:46:07,339,339 repair DEBUG Merging row 5 with data from previous row(s)
2024-09-12 20:46:07,339,339 repair DEBUG Empty line detected at row index=6
2024-09-12 20:46:07,339,339 repair DEBUG Can't parse id cell: 't1' - assuming row is broken
2024-09-12 20:46:07,339,339 repair DEBUG Examining row index=7 id=None: ['t1', 'train']
2024-09-12 20:46:07,339,339 repair DEBUG Merging row 7 with data from previous row(s)
2024-09-12 20:46:07,339,339 repair DEBUG Examining row index=8 id=2: ['2', 'source2', 'target2', 'train']
2024-09-12 20:46:07,339,339 repair DEBUG New row starting at index=8, committing previous row ['1', 'source1', 'target1', 'train']
2024-09-12 20:46:07,339,339 repair INFO Repair complete
2024-09-12 20:46:07,339,339 __main__ DEBUG Loading row 0 into sentence pair structure: ['0', 'source0', 'target0', 'train']
2024-09-12 20:46:07,339,339 __main__ DEBUG Loading row 1 into sentence pair structure: ['1', 'source1', 'target1', 'train']
2024-09-12 20:46:07,339,339 __main__ DEBUG Loading row 2 into sentence pair structure: ['2', 'source2', 'target2', 'train']
2024-09-12 20:46:07,339,339 __main__ INFO Creating extract files
2024-09-12 20:46:07,339,339 __main__ INFO Using specified output directory
2024-09-12 20:46:07,339,339 __main__ INFO Outputting to directory: /tmp/xri
2024-09-12 20:46:07,339,339 __main__ DEBUG Creating source and target files for suffix: 'all'
2024-09-12 20:46:07,339,339 __main__ DEBUG Writing 3 sentences to file: /tmp/xri/swa-test.all.txt
2024-09-12 20:46:07,339,339 __main__ DEBUG Writing 3 sentences to file: /tmp/xri/ngq-test.all.txt
2024-09-12 20:46:07,339,339 __main__ DEBUG Creating source and target files for suffix: 'train'
2024-09-12 20:46:07,339,339 __main__ DEBUG Writing 3 sentences to file: /tmp/xri/swa-test.train.txt
2024-09-12 20:46:07,339,339 __main__ DEBUG Writing 3 sentences to file: /tmp/xri/ngq-test.train.txt
2024-09-12 20:46:07,339,339 __main__ DEBUG Creating source and target files for suffix: 'val'
2024-09-12 20:46:07,339,339 __main__ DEBUG Writing 0 sentences to file: /tmp/xri/swa-test.val.txt
2024-09-12 20:46:07,340,340 __main__ DEBUG Writing 0 sentences to file: /tmp/xri/ngq-test.val.txt
2024-09-12 20:46:07,340,340 __main__ DEBUG Creating source and target files for suffix: 'test'
2024-09-12 20:46:07,340,340 __main__ DEBUG Writing 0 sentences to file: /tmp/xri/swa-test.test.txt
2024-09-12 20:46:07,340,340 __main__ DEBUG Writing 0 sentences to file: /tmp/xri/ngq-test.test.txt
2024-09-12 20:46:07,340,340 __main__ INFO Completed script

ddaspit

Reviewed 1 of 1 files at r1, all commit messages.
Reviewable status: complete! all files reviewed, all discussions resolved (waiting on @mmartin9684-sil)

rminsil · 2024-09-13T03:58:12Z

Note I've rebased the PR and updated the repair logger based to follow what Damien outlined in this comment: #514 (review)

UPDATE: I messed up the rebasing and this commit got moved to #517. I don't think it will have any effect on the git history because I use "Rebase and Merge" option.

ddaspit

Reviewed 1 of 1 files at r2, all commit messages.
Reviewable status: complete! all files reviewed, all discussions resolved (waiting on @mmartin9684-sil)

rminsil requested a review from ddaspit September 12, 2024 11:08

rminsil linked an issue Sep 12, 2024 that may be closed by this pull request

Add data transformation and cleaning to XRI etl script #494

Open

rminsil requested a review from mmartin9684-sil September 12, 2024 11:16

ddaspit approved these changes Sep 12, 2024

View reviewed changes

This was referenced Sep 13, 2024

Issue 473: change how output directory is generated for extract_xri script #507

Merged

Issue 473 Add optional debug/info logging to extract_xri script #514

Merged

rminsil force-pushed the issue-494-detect-and-repair-split-sentences branch from b2f2789 to 3f3a287 Compare September 13, 2024 03:57

ddaspit approved these changes Sep 13, 2024

View reviewed changes

rminsil force-pushed the issue-473-add-detailed-logging branch from f4601bb to 24b02d4 Compare September 19, 2024 05:04

Issue 494: Add repair logic for split lines to extract_xri script

75e9e2f

rminsil force-pushed the issue-494-detect-and-repair-split-sentences branch from 3f3a287 to 75e9e2f Compare September 19, 2024 05:10

rminsil changed the base branch from issue-473-add-detailed-logging to master September 19, 2024 05:10

rminsil merged commit 1b98cca into master Sep 19, 2024
1 check was pending

rminsil deleted the issue-494-detect-and-repair-split-sentences branch September 19, 2024 05:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issue 494: Add repair logic for split lines to extract_xri script #515

Issue 494: Add repair logic for split lines to extract_xri script #515

rminsil commented Sep 12, 2024 •

edited by ddaspit

Loading

rminsil commented Sep 12, 2024

ddaspit left a comment

rminsil commented Sep 13, 2024 •

edited

Loading

ddaspit left a comment

Issue 494: Add repair logic for split lines to extract_xri script #515

Issue 494: Add repair logic for split lines to extract_xri script #515

Conversation

rminsil commented Sep 12, 2024 • edited by ddaspit Loading

rminsil commented Sep 12, 2024

ddaspit left a comment

Choose a reason for hiding this comment

rminsil commented Sep 13, 2024 • edited Loading

ddaspit left a comment

Choose a reason for hiding this comment

rminsil commented Sep 12, 2024 •

edited by ddaspit

Loading

rminsil commented Sep 13, 2024 •

edited

Loading