-
-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Issue 494: Add normalization of punctuation #557
Conversation
1e9cebd
to
25c04a2
Compare
Rohan, this looks great. There's one remaining issue and I'm trying to decide how to formulate it. In this example, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good. Just a couple of small comments.
Reviewed 2 of 2 files at r1, all commit messages.
Reviewable status: all files reviewed, 3 unresolved discussions (waiting on @rminsil)
1
line 0 at r1 (raw file):
Was this file included by accident?
silnlp/common/normalize_extracts.py
line 0 at r1 (raw file):
I would put "xri" in the script name somewhere, maybe normalize_xri.py
.
silnlp/common/normalize_extracts.py
line 55 at r1 (raw file):
class PunctuationCategory(Enum): left_clinging = "left_clinging"
We use all caps for enum values.
25c04a2
to
c8137ad
Compare
@ddaspit Thanks for the review!
Yeah it was an accident. I was trying to pipe the standard error to standard output and got the symbols around the wrong way. I've removed that file now.
My impression was that this script wasn't specific to xri data though. The script is quite generic so I was hoping it would be helpful in other contexts. @mmartin9684-sil FYI
This is fixed now. |
Thanks for the feedback! It's nice to get more input from different team members now.
There's definitely more than 1 remaining issue :) The script right now is deliberately not attacking consecutive punctuation for this first phase. In my big comment #494 (comment) I talk about it under "Undefined situations / consecutive punctuation".
Addressing your comment:
So I put it into the "too hard" basket for my first iteration of the script. It will warn about it though so the user is aware of it. I put that ".," example into my PR to demonstrate that. I'm keen to fix these edge cases, but I'm first building out the main logic and getting it working for regular cases with good reporting functionality first.
I just want to clarify that the normalization process right now isn't hard-coded to particular rules for particular characters. That way it can be adjusted for different languages. The example in the PR description comes from this simple configuration:
In my big comment there's a section describing how a config file lets you specify what category characters fall into. My example only demonstrates "right clinging" characters which are the ones like comma and period that stick to the right of characters. That corresponds to what I think you meant by "end punctuation". I am writing some tests for my next PR which will show off the other kinds of rules. Note as well though double quotes can be left or right clinging depending on context, e.g.
Single quote
If you don't know which way it clings, you don't know how to adjust the spacing. Effectively for these 2 you want to know if they're acting as opening or closing quotes. In my big comment I wrote:
|
@ddaspit I'll merge the PR as I have others backing up behind it. If we decide to rename it I'll address that in a follow up PR. |
This PR is introduces the types and main happy path logic associated with normalizing punctuation.
See this comment for context: #494 (comment)
The code is full of TODO's that will be addressed in subsequent PR's. It was getting too big for one PR...
Example:
It finds 4 candidates for normalization:
.,
at index 5 - this one has two punctuation characters so it leaves it alone"! "
effectively removing the leading whitespaceThis change is