Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue 494: Add repair logic for split lines to extract_xri script #515

Merged
merged 1 commit into from
Sep 19, 2024

Conversation

rminsil
Copy link
Collaborator

@rminsil rminsil commented Sep 12, 2024

This PR relates to Michael's request on issue #494 for split lines:

Filtering or correcting split sentences

  • Discard sentence pairs that don't follow the expected format
  • Stretch goal would be to re-assemble and retain sentence pairs that are split over 2 lines because of an embedded newline

The PR addresses the stretch goal part in that it repairs split lines, for example if the input tsv was this mess:

id	source	target	split

0	source0	tar
get0	train
1	sour
ce1	tar
ge

t1	train
2	source2	target2	train

then it will process it as if it was this tsv:

id	source	target	split
0	source0	target0	train
1	source1	target1	train
2	source2	target2	train

The first request about discarding records that can't be processed hasn't been implemented. Currently the script would just crash in situations where it can't parse something rather than gracefully ignoring it. An example situation would be if there were less than 4 cells in a row. But I have at added a simple warning that would detect some of these cases.

The logic is a bit complex, you'll want to have had your morning coffee before reviewing. Basically it's like a fold that works its way through the rows accumulating the first line with any broken lines that follow. When it gets to a new line, it commits what it's accumulated so far and starts again.

Like usual, this PR is built off the back of another one that is waiting to merge.


This change is Reviewable

@rminsil
Copy link
Collaborator Author

rminsil commented Sep 12, 2024

To help you understand the logic, here's sample logs from running it against this dummy file:

python -m silnlp.common.extract_xri simple.tsv swa ngq test -output /tmp/xri -log output.log
id	source	target	split

0	source0	tar
get0	train
1	sour
ce1	tar
ge

t1	train
2	source2	target2	train
2024-09-12 20:46:07,338,338 __main__ INFO Starting script
2024-09-12 20:46:07,338,338 __main__ INFO Loading sentence pairs
2024-09-12 20:46:07,338,338 __main__ DEBUG Opening file
2024-09-12 20:46:07,338,338 __main__ DEBUG Determining column indexes
2024-09-12 20:46:07,338,338 __main__ DEBUG Column id found at index 0
2024-09-12 20:46:07,338,338 __main__ DEBUG Column source found at index 1
2024-09-12 20:46:07,338,338 __main__ DEBUG Column target found at index 2
2024-09-12 20:46:07,338,338 __main__ DEBUG Column split found at index 3
2024-09-12 20:46:07,338,338 __main__ DEBUG Column indexes: id=0 source=1 target=2 split=3
2024-09-12 20:46:07,338,338 __main__ DEBUG Checking all rows contain 4 cells
2024-09-12 20:46:07,338,338 __main__ WARNING Not all rows contain 4 cells
2024-09-12 20:46:07,339,339 repair INFO Repair starting
2024-09-12 20:46:07,339,339 repair DEBUG Empty line detected at row index=0
2024-09-12 20:46:07,339,339 repair DEBUG Examining row index=1 id=0: ['0', 'source0', 'tar']
2024-09-12 20:46:07,339,339 repair DEBUG Can't parse id cell: 'get0' - assuming row is broken
2024-09-12 20:46:07,339,339 repair DEBUG Examining row index=2 id=None: ['get0', 'train']
2024-09-12 20:46:07,339,339 repair DEBUG Merging row 2 with data from previous row(s)
2024-09-12 20:46:07,339,339 repair DEBUG Examining row index=3 id=1: ['1', 'sour']
2024-09-12 20:46:07,339,339 repair DEBUG New row starting at index=3, committing previous row ['0', 'source0', 'target0', 'train']
2024-09-12 20:46:07,339,339 repair DEBUG Can't parse id cell: 'ce1' - assuming row is broken
2024-09-12 20:46:07,339,339 repair DEBUG Examining row index=4 id=None: ['ce1', 'tar']
2024-09-12 20:46:07,339,339 repair DEBUG Merging row 4 with data from previous row(s)
2024-09-12 20:46:07,339,339 repair DEBUG Can't parse id cell: 'ge' - assuming row is broken
2024-09-12 20:46:07,339,339 repair DEBUG Examining row index=5 id=None: ['ge']
2024-09-12 20:46:07,339,339 repair DEBUG Merging row 5 with data from previous row(s)
2024-09-12 20:46:07,339,339 repair DEBUG Empty line detected at row index=6
2024-09-12 20:46:07,339,339 repair DEBUG Can't parse id cell: 't1' - assuming row is broken
2024-09-12 20:46:07,339,339 repair DEBUG Examining row index=7 id=None: ['t1', 'train']
2024-09-12 20:46:07,339,339 repair DEBUG Merging row 7 with data from previous row(s)
2024-09-12 20:46:07,339,339 repair DEBUG Examining row index=8 id=2: ['2', 'source2', 'target2', 'train']
2024-09-12 20:46:07,339,339 repair DEBUG New row starting at index=8, committing previous row ['1', 'source1', 'target1', 'train']
2024-09-12 20:46:07,339,339 repair INFO Repair complete
2024-09-12 20:46:07,339,339 __main__ DEBUG Loading row 0 into sentence pair structure: ['0', 'source0', 'target0', 'train']
2024-09-12 20:46:07,339,339 __main__ DEBUG Loading row 1 into sentence pair structure: ['1', 'source1', 'target1', 'train']
2024-09-12 20:46:07,339,339 __main__ DEBUG Loading row 2 into sentence pair structure: ['2', 'source2', 'target2', 'train']
2024-09-12 20:46:07,339,339 __main__ INFO Creating extract files
2024-09-12 20:46:07,339,339 __main__ INFO Using specified output directory
2024-09-12 20:46:07,339,339 __main__ INFO Outputting to directory: /tmp/xri
2024-09-12 20:46:07,339,339 __main__ DEBUG Creating source and target files for suffix: 'all'
2024-09-12 20:46:07,339,339 __main__ DEBUG Writing 3 sentences to file: /tmp/xri/swa-test.all.txt
2024-09-12 20:46:07,339,339 __main__ DEBUG Writing 3 sentences to file: /tmp/xri/ngq-test.all.txt
2024-09-12 20:46:07,339,339 __main__ DEBUG Creating source and target files for suffix: 'train'
2024-09-12 20:46:07,339,339 __main__ DEBUG Writing 3 sentences to file: /tmp/xri/swa-test.train.txt
2024-09-12 20:46:07,339,339 __main__ DEBUG Writing 3 sentences to file: /tmp/xri/ngq-test.train.txt
2024-09-12 20:46:07,339,339 __main__ DEBUG Creating source and target files for suffix: 'val'
2024-09-12 20:46:07,339,339 __main__ DEBUG Writing 0 sentences to file: /tmp/xri/swa-test.val.txt
2024-09-12 20:46:07,340,340 __main__ DEBUG Writing 0 sentences to file: /tmp/xri/ngq-test.val.txt
2024-09-12 20:46:07,340,340 __main__ DEBUG Creating source and target files for suffix: 'test'
2024-09-12 20:46:07,340,340 __main__ DEBUG Writing 0 sentences to file: /tmp/xri/swa-test.test.txt
2024-09-12 20:46:07,340,340 __main__ DEBUG Writing 0 sentences to file: /tmp/xri/ngq-test.test.txt
2024-09-12 20:46:07,340,340 __main__ INFO Completed script

@rminsil rminsil linked an issue Sep 12, 2024 that may be closed by this pull request
Copy link
Collaborator

@ddaspit ddaspit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewed 1 of 1 files at r1, all commit messages.
Reviewable status: :shipit: complete! all files reviewed, all discussions resolved (waiting on @mmartin9684-sil)

@rminsil
Copy link
Collaborator Author

rminsil commented Sep 13, 2024

Note I've rebased the PR and updated the repair logger based to follow what Damien outlined in this comment: #514 (review)

UPDATE: I messed up the rebasing and this commit got moved to #517. I don't think it will have any effect on the git history because I use "Rebase and Merge" option.

Copy link
Collaborator

@ddaspit ddaspit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 1 of 1 files at r2, all commit messages.
Reviewable status: :shipit: complete! all files reviewed, all discussions resolved (waiting on @mmartin9684-sil)

@rminsil rminsil force-pushed the issue-473-add-detailed-logging branch from f4601bb to 24b02d4 Compare September 19, 2024 05:04
@rminsil rminsil force-pushed the issue-494-detect-and-repair-split-sentences branch from 3f3a287 to 75e9e2f Compare September 19, 2024 05:10
@rminsil rminsil changed the base branch from issue-473-add-detailed-logging to master September 19, 2024 05:10
@rminsil rminsil merged commit 1b98cca into master Sep 19, 2024
1 check was pending
@rminsil rminsil deleted the issue-494-detect-and-repair-split-sentences branch September 19, 2024 05:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add data transformation and cleaning to XRI etl script
2 participants