Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue 494: Add normalization of punctuation #557

Merged
merged 2 commits into from
Oct 11, 2024

Conversation

rminsil
Copy link
Collaborator

@rminsil rminsil commented Oct 10, 2024

This PR is introduces the types and main happy path logic associated with normalizing punctuation.

See this comment for context: #494 (comment)

The code is full of TODO's that will be addressed in subsequent PR's. It was getting too big for one PR...

Example:

# Input sentence
Hello.,there my  friend ! How are you   ?
012345678901234567890123456789012345678901
0         1         2         3         4

# Normalized
Hello.,there my friend! How are you? 

It finds 4 candidates for normalization:

  • ., at index 5 - this one has two punctuation characters so it leaves it alone
  • double space at index 15 - this gets reduced to a single space
  • exclamation point surrounded by whitespace at index 23 - this gets replaced with "! " effectively removing the leading whitespace
  • question mark with multiple leading whitespace at index 37 - all the leading whitespace is removed
python -m silnlp.common.normalize_extracts hackery --filter sentences.txt --log-level DEBUG --overwrite True

2024-10-10 22:26:26,955 - silnlp.common.normalize_extracts - INFO - Starting script
2024-10-10 22:26:26,955 - silnlp.common.normalize_extracts - INFO - Output dir set to hackery
2024-10-10 22:26:26,955 - silnlp.common.normalize_extracts - DEBUG - Searching files in input dir: 'hackery' that satisfy glob 'sentences.txt'
2024-10-10 22:26:26,955 - silnlp.common.normalize_extracts - INFO - Found 1 files to normalize
2024-10-10 22:26:26,955 - silnlp.common.normalize_extracts - DEBUG - Processing file hackery/sentences.txt
2024-10-10 22:26:26,955 - silnlp.common.normalize_extracts - DEBUG - Outputting to hackery/sentences.norm.txt
2024-10-10 22:26:26,955 - silnlp.common.normalize_extracts - DEBUG - Found 1 lines in file
2024-10-10 22:26:26,955 - silnlp.common.normalize_extracts - DEBUG - Normalizing 'Hello.,there my  friend ! How are you   ?'
2024-10-10 22:26:26,955 - silnlp.common.normalize_extracts - DEBUG - Hello.,there my  friend ! How are you   ?
2024-10-10 22:26:26,955 - silnlp.common.normalize_extracts - DEBUG - 012345678901234567890123456789
2024-10-10 22:26:26,955 - silnlp.common.normalize_extracts - DEBUG - 0         1         2         
2024-10-10 22:26:26,956 - silnlp.common.normalize_extracts - DEBUG -      **
2024-10-10 22:26:26,956 - silnlp.common.normalize_extracts - DEBUG -      (5,7)
2024-10-10 22:26:26,956 - silnlp.common.normalize_extracts - DEBUG - Hello.,there my  friend ! How are you   ?
2024-10-10 22:26:26,956 - silnlp.common.normalize_extracts - DEBUG - 012345678901234567890123456789
2024-10-10 22:26:26,956 - silnlp.common.normalize_extracts - DEBUG - 0         1         2         
2024-10-10 22:26:26,956 - silnlp.common.normalize_extracts - DEBUG -                **
2024-10-10 22:26:26,956 - silnlp.common.normalize_extracts - DEBUG -                (15,17)
2024-10-10 22:26:26,956 - silnlp.common.normalize_extracts - DEBUG - Hello.,there my  friend ! How are you   ?
2024-10-10 22:26:26,956 - silnlp.common.normalize_extracts - DEBUG - 012345678901234567890123456789
2024-10-10 22:26:26,956 - silnlp.common.normalize_extracts - DEBUG - 0         1         2         
2024-10-10 22:26:26,956 - silnlp.common.normalize_extracts - DEBUG -                        ***
2024-10-10 22:26:26,956 - silnlp.common.normalize_extracts - DEBUG -                        (23,26)
2024-10-10 22:26:26,956 - silnlp.common.normalize_extracts - DEBUG - Hello.,there my  friend ! How are you   ?
2024-10-10 22:26:26,956 - silnlp.common.normalize_extracts - DEBUG - 012345678901234567890123456789
2024-10-10 22:26:26,956 - silnlp.common.normalize_extracts - DEBUG - 0         1         2         
2024-10-10 22:26:26,956 - silnlp.common.normalize_extracts - DEBUG -                                      ****
2024-10-10 22:26:26,956 - silnlp.common.normalize_extracts - DEBUG -                                      (37,41)
2024-10-10 22:26:26,956 - silnlp.common.normalize_extracts - DEBUG - #consecutive space blocks   =1
2024-10-10 22:26:26,956 - silnlp.common.normalize_extracts - DEBUG - #single punctuation chunks  =2
2024-10-10 22:26:26,956 - silnlp.common.normalize_extracts - DEBUG - #multiple punctuation chunks=1
2024-10-10 22:26:26,956 - silnlp.common.normalize_extracts - DEBUG - Writing 1 sentences to file: hackery/sentences.norm.txt
2024-10-10 22:26:26,956 - silnlp.common.normalize_extracts - DEBUG - Finished processing hackery/sentences.txt
2024-10-10 22:26:26,956 - silnlp.common.normalize_extracts - INFO - Completed script

This change is Reviewable

@rminsil rminsil requested a review from ddaspit October 10, 2024 11:20
@rminsil rminsil force-pushed the issue-494-add-punctuation-normalization branch from 1e9cebd to 25c04a2 Compare October 10, 2024 11:26
@bhartmoore
Copy link
Collaborator

Rohan, this looks great. There's one remaining issue and I'm trying to decide how to formulate it. In this example, Hello.,there needs to become Hello., there or Hello. , there. End punctuation, comma, period, double quotes and parentheses should not be sandwiched in among alpha characters with no whitespace. Thus a comma could appear between a word and a quotation mark, or a parenthesis could appear beween a word and a period, but then there should be whitespace before the next word begins. I am thinking (this could need adjustment) that only a single ' or - should be allowed between alpha characters without whitespace adjacent to the (string of) punctuation marks.

Copy link
Collaborator

@ddaspit ddaspit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. Just a couple of small comments.

Reviewed 2 of 2 files at r1, all commit messages.
Reviewable status: all files reviewed, 3 unresolved discussions (waiting on @rminsil)


1 line 0 at r1 (raw file):
Was this file included by accident?


silnlp/common/normalize_extracts.py line 0 at r1 (raw file):
I would put "xri" in the script name somewhere, maybe normalize_xri.py.


silnlp/common/normalize_extracts.py line 55 at r1 (raw file):

class PunctuationCategory(Enum):
    left_clinging = "left_clinging"

We use all caps for enum values.

@rminsil rminsil linked an issue Oct 10, 2024 that may be closed by this pull request
@rminsil rminsil force-pushed the issue-494-add-punctuation-normalization branch from 25c04a2 to c8137ad Compare October 11, 2024 00:44
@rminsil
Copy link
Collaborator Author

rminsil commented Oct 11, 2024

@ddaspit Thanks for the review!

Was this file included by accident?

Yeah it was an accident. I was trying to pipe the standard error to standard output and got the symbols around the wrong way. I've removed that file now.

I would put "xri" in the script name somewhere, maybe normalize_xri.py.

My impression was that this script wasn't specific to xri data though. The script is quite generic so I was hoping it would be helpful in other contexts. @mmartin9684-sil FYI

We use all caps for enum values.

This is fixed now.

@rminsil
Copy link
Collaborator Author

rminsil commented Oct 11, 2024

@bhartmoore

Thanks for the feedback! It's nice to get more input from different team members now.

There's one remaining issue and I'm trying to decide how to formulate it. In this example, Hello.,there needs to become Hello., there or Hello. , there.

There's definitely more than 1 remaining issue :)

The script right now is deliberately not attacking consecutive punctuation for this first phase. In my big comment #494 (comment) I talk about it under "Undefined situations / consecutive punctuation".

Consecutive punctuation can cause some edge case situations that you can't resolve with the rules above.

For example a left clinging character followed by an unclinging character: "(-"

  • the left clinging character requires no space between it and the character to its right
  • the unclinging character requires a space between it and the character to its left
  • (you can't satisfy both)

Addressing your comment:

If there are two consecutive punctuation characters, the situation is probably unusual and requires a human. For example Bethany mentioned how sometimes two single quotes are sometimes used to represent a double quote. In those weird situations the script will bail out with a warning and not normalize that part of the sentence as there's a good chance the punctuation characters are doing something unique and novel which the script isn't designed to handle.

So I put it into the "too hard" basket for my first iteration of the script. It will warn about it though so the user is aware of it. I put that ".," example into my PR to demonstrate that.

I'm keen to fix these edge cases, but I'm first building out the main logic and getting it working for regular cases with good reporting functionality first.

End punctuation, comma, period, double quotes and parentheses should not be sandwiched in among alpha characters with no whitespace. ...
I am thinking (this could need adjustment) that only a single ' or - should be allowed between alpha characters without whitespace adjacent to the (string of) punctuation marks.

I just want to clarify that the normalization process right now isn't hard-coded to particular rules for particular characters. That way it can be adjusted for different languages. The example in the PR description comes from this simple configuration:

    normalizer = Normalizer(
        [
            PunctuationNormalizationRule(".", PunctuationCategory.RIGHT_CLINGING),
            PunctuationNormalizationRule(",", PunctuationCategory.RIGHT_CLINGING),
            PunctuationNormalizationRule("?", PunctuationCategory.RIGHT_CLINGING),
            PunctuationNormalizationRule("!", PunctuationCategory.RIGHT_CLINGING),
        ]
    )

In my big comment there's a section describing how a config file lets you specify what category characters fall into.

My example only demonstrates "right clinging" characters which are the ones like comma and period that stick to the right of characters. That corresponds to what I think you meant by "end punctuation". I am writing some tests for my next PR which will show off the other kinds of rules.

Note as well though double quotes can be left or right clinging depending on context, e.g.

She said: "Hello there"
          ^           ^
        left         right
        clinging     clinging

Single quote ' is another example from English. There is a punctuation category LEFT_RIGHT_CLINGING for these. These guys are ambidextrous which makes them harder to reason about in edge cases like:

She said"hello there".
        ?

She said " hello there".
         ?

If you don't know which way it clings, you don't know how to adjust the spacing. Effectively for these 2 you want to know if they're acting as opening or closing quotes. In my big comment I wrote:

(The script) could figure out if it's opening or closing by reading from the start of the sentence, but I'll wait to see if this is a real problem before putting in complex logic like that.

@rminsil
Copy link
Collaborator Author

rminsil commented Oct 11, 2024

I would put "xri" in the script name somewhere, maybe normalize_xri.py.

My impression was that this script wasn't specific to xri data though. The script is quite generic so I was hoping it would be helpful in other contexts. @mmartin9684-sil FYI

@ddaspit I'll merge the PR as I have others backing up behind it. If we decide to rename it I'll address that in a follow up PR.

@rminsil rminsil merged commit dd935b8 into master Oct 11, 2024
1 check was pending
@rminsil rminsil deleted the issue-494-add-punctuation-normalization branch October 11, 2024 01:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add data transformation and cleaning to XRI etl script
3 participants