-
-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Issue 494: Add repair logic for split lines to extract_xri script #515
Conversation
To help you understand the logic, here's sample logs from running it against this dummy file: python -m silnlp.common.extract_xri simple.tsv swa ngq test -output /tmp/xri -log output.log id source target split
0 source0 tar
get0 train
1 sour
ce1 tar
ge
t1 train
2 source2 target2 train
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 1 of 1 files at r1, all commit messages.
Reviewable status: complete! all files reviewed, all discussions resolved (waiting on @mmartin9684-sil)
b2f2789
to
3f3a287
Compare
Note I've rebased the PR and updated the repair logger based to follow what Damien outlined in this comment: #514 (review) UPDATE: I messed up the rebasing and this commit got moved to #517. I don't think it will have any effect on the git history because I use "Rebase and Merge" option. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 1 of 1 files at r2, all commit messages.
Reviewable status: complete! all files reviewed, all discussions resolved (waiting on @mmartin9684-sil)
f4601bb
to
24b02d4
Compare
3f3a287
to
75e9e2f
Compare
This PR relates to Michael's request on issue #494 for split lines:
The PR addresses the stretch goal part in that it repairs split lines, for example if the input tsv was this mess:
then it will process it as if it was this tsv:
The first request about discarding records that can't be processed hasn't been implemented. Currently the script would just crash in situations where it can't parse something rather than gracefully ignoring it. An example situation would be if there were less than 4 cells in a row. But I have at added a simple warning that would detect some of these cases.
The logic is a bit complex, you'll want to have had your morning coffee before reviewing. Basically it's like a fold that works its way through the rows accumulating the first line with any broken lines that follow. When it gets to a new line, it commits what it's accumulated so far and starts again.
Like usual, this PR is built off the back of another one that is waiting to merge.
This change is