Issue 494: Add normalization of punctuation #557

rminsil · 2024-10-10T11:20:00Z

This PR is introduces the types and main happy path logic associated with normalizing punctuation.

See this comment for context: #494 (comment)

The code is full of TODO's that will be addressed in subsequent PR's. It was getting too big for one PR...

Example:

# Input sentence
Hello.,there my  friend ! How are you   ?
012345678901234567890123456789012345678901
0         1         2         3         4

# Normalized
Hello.,there my friend! How are you?

It finds 4 candidates for normalization:

., at index 5 - this one has two punctuation characters so it leaves it alone
double space at index 15 - this gets reduced to a single space
exclamation point surrounded by whitespace at index 23 - this gets replaced with "! " effectively removing the leading whitespace
question mark with multiple leading whitespace at index 37 - all the leading whitespace is removed

python -m silnlp.common.normalize_extracts hackery --filter sentences.txt --log-level DEBUG --overwrite True

2024-10-10 22:26:26,955 - silnlp.common.normalize_extracts - INFO - Starting script
2024-10-10 22:26:26,955 - silnlp.common.normalize_extracts - INFO - Output dir set to hackery
2024-10-10 22:26:26,955 - silnlp.common.normalize_extracts - DEBUG - Searching files in input dir: 'hackery' that satisfy glob 'sentences.txt'
2024-10-10 22:26:26,955 - silnlp.common.normalize_extracts - INFO - Found 1 files to normalize
2024-10-10 22:26:26,955 - silnlp.common.normalize_extracts - DEBUG - Processing file hackery/sentences.txt
2024-10-10 22:26:26,955 - silnlp.common.normalize_extracts - DEBUG - Outputting to hackery/sentences.norm.txt
2024-10-10 22:26:26,955 - silnlp.common.normalize_extracts - DEBUG - Found 1 lines in file
2024-10-10 22:26:26,955 - silnlp.common.normalize_extracts - DEBUG - Normalizing 'Hello.,there my  friend ! How are you   ?'
2024-10-10 22:26:26,955 - silnlp.common.normalize_extracts - DEBUG - Hello.,there my  friend ! How are you   ?
2024-10-10 22:26:26,955 - silnlp.common.normalize_extracts - DEBUG - 012345678901234567890123456789
2024-10-10 22:26:26,955 - silnlp.common.normalize_extracts - DEBUG - 0         1         2         
2024-10-10 22:26:26,956 - silnlp.common.normalize_extracts - DEBUG -      **
2024-10-10 22:26:26,956 - silnlp.common.normalize_extracts - DEBUG -      (5,7)
2024-10-10 22:26:26,956 - silnlp.common.normalize_extracts - DEBUG - Hello.,there my  friend ! How are you   ?
2024-10-10 22:26:26,956 - silnlp.common.normalize_extracts - DEBUG - 012345678901234567890123456789
2024-10-10 22:26:26,956 - silnlp.common.normalize_extracts - DEBUG - 0         1         2         
2024-10-10 22:26:26,956 - silnlp.common.normalize_extracts - DEBUG -                **
2024-10-10 22:26:26,956 - silnlp.common.normalize_extracts - DEBUG -                (15,17)
2024-10-10 22:26:26,956 - silnlp.common.normalize_extracts - DEBUG - Hello.,there my  friend ! How are you   ?
2024-10-10 22:26:26,956 - silnlp.common.normalize_extracts - DEBUG - 012345678901234567890123456789
2024-10-10 22:26:26,956 - silnlp.common.normalize_extracts - DEBUG - 0         1         2         
2024-10-10 22:26:26,956 - silnlp.common.normalize_extracts - DEBUG -                        ***
2024-10-10 22:26:26,956 - silnlp.common.normalize_extracts - DEBUG -                        (23,26)
2024-10-10 22:26:26,956 - silnlp.common.normalize_extracts - DEBUG - Hello.,there my  friend ! How are you   ?
2024-10-10 22:26:26,956 - silnlp.common.normalize_extracts - DEBUG - 012345678901234567890123456789
2024-10-10 22:26:26,956 - silnlp.common.normalize_extracts - DEBUG - 0         1         2         
2024-10-10 22:26:26,956 - silnlp.common.normalize_extracts - DEBUG -                                      ****
2024-10-10 22:26:26,956 - silnlp.common.normalize_extracts - DEBUG -                                      (37,41)
2024-10-10 22:26:26,956 - silnlp.common.normalize_extracts - DEBUG - #consecutive space blocks   =1
2024-10-10 22:26:26,956 - silnlp.common.normalize_extracts - DEBUG - #single punctuation chunks  =2
2024-10-10 22:26:26,956 - silnlp.common.normalize_extracts - DEBUG - #multiple punctuation chunks=1
2024-10-10 22:26:26,956 - silnlp.common.normalize_extracts - DEBUG - Writing 1 sentences to file: hackery/sentences.norm.txt
2024-10-10 22:26:26,956 - silnlp.common.normalize_extracts - DEBUG - Finished processing hackery/sentences.txt
2024-10-10 22:26:26,956 - silnlp.common.normalize_extracts - INFO - Completed script

This change is

bhartmoore · 2024-10-10T13:04:45Z

Rohan, this looks great. There's one remaining issue and I'm trying to decide how to formulate it. In this example, Hello.,there needs to become Hello., there or Hello. , there. End punctuation, comma, period, double quotes and parentheses should not be sandwiched in among alpha characters with no whitespace. Thus a comma could appear between a word and a quotation mark, or a parenthesis could appear beween a word and a period, but then there should be whitespace before the next word begins. I am thinking (this could need adjustment) that only a single ' or - should be allowed between alpha characters without whitespace adjacent to the (string of) punctuation marks.

ddaspit

Looks good. Just a couple of small comments.

Reviewed 2 of 2 files at r1, all commit messages.
Reviewable status: all files reviewed, 3 unresolved discussions (waiting on @rminsil)

1 line 0 at r1 (raw file):
Was this file included by accident?

silnlp/common/normalize_extracts.py line 0 at r1 (raw file):
I would put "xri" in the script name somewhere, maybe normalize_xri.py.

silnlp/common/normalize_extracts.py line 55 at r1 (raw file):

class PunctuationCategory(Enum):
    left_clinging = "left_clinging"

We use all caps for enum values.

rminsil · 2024-10-11T00:48:44Z

@ddaspit Thanks for the review!

Was this file included by accident?

Yeah it was an accident. I was trying to pipe the standard error to standard output and got the symbols around the wrong way. I've removed that file now.

I would put "xri" in the script name somewhere, maybe normalize_xri.py.

My impression was that this script wasn't specific to xri data though. The script is quite generic so I was hoping it would be helpful in other contexts. @mmartin9684-sil FYI

We use all caps for enum values.

This is fixed now.

rminsil · 2024-10-11T01:11:50Z

@bhartmoore

Thanks for the feedback! It's nice to get more input from different team members now.

There's one remaining issue and I'm trying to decide how to formulate it. In this example, Hello.,there needs to become Hello., there or Hello. , there.

There's definitely more than 1 remaining issue :)

The script right now is deliberately not attacking consecutive punctuation for this first phase. In my big comment #494 (comment) I talk about it under "Undefined situations / consecutive punctuation".

Consecutive punctuation can cause some edge case situations that you can't resolve with the rules above.

For example a left clinging character followed by an unclinging character: "(-"

the left clinging character requires no space between it and the character to its right

the unclinging character requires a space between it and the character to its left

(you can't satisfy both)

Addressing your comment:

If there are two consecutive punctuation characters, the situation is probably unusual and requires a human. For example Bethany mentioned how sometimes two single quotes are sometimes used to represent a double quote. In those weird situations the script will bail out with a warning and not normalize that part of the sentence as there's a good chance the punctuation characters are doing something unique and novel which the script isn't designed to handle.

So I put it into the "too hard" basket for my first iteration of the script. It will warn about it though so the user is aware of it. I put that ".," example into my PR to demonstrate that.

I'm keen to fix these edge cases, but I'm first building out the main logic and getting it working for regular cases with good reporting functionality first.

End punctuation, comma, period, double quotes and parentheses should not be sandwiched in among alpha characters with no whitespace. ...
I am thinking (this could need adjustment) that only a single ' or - should be allowed between alpha characters without whitespace adjacent to the (string of) punctuation marks.

I just want to clarify that the normalization process right now isn't hard-coded to particular rules for particular characters. That way it can be adjusted for different languages. The example in the PR description comes from this simple configuration:

    normalizer = Normalizer(
        [
            PunctuationNormalizationRule(".", PunctuationCategory.RIGHT_CLINGING),
            PunctuationNormalizationRule(",", PunctuationCategory.RIGHT_CLINGING),
            PunctuationNormalizationRule("?", PunctuationCategory.RIGHT_CLINGING),
            PunctuationNormalizationRule("!", PunctuationCategory.RIGHT_CLINGING),
        ]
    )

In my big comment there's a section describing how a config file lets you specify what category characters fall into.

My example only demonstrates "right clinging" characters which are the ones like comma and period that stick to the right of characters. That corresponds to what I think you meant by "end punctuation". I am writing some tests for my next PR which will show off the other kinds of rules.

Note as well though double quotes can be left or right clinging depending on context, e.g.

She said: "Hello there"
          ^           ^
        left         right
        clinging     clinging

Single quote ' is another example from English. There is a punctuation category LEFT_RIGHT_CLINGING for these. These guys are ambidextrous which makes them harder to reason about in edge cases like:

She said"hello there".
        ?

She said " hello there".
         ?

If you don't know which way it clings, you don't know how to adjust the spacing. Effectively for these 2 you want to know if they're acting as opening or closing quotes. In my big comment I wrote:

(The script) could figure out if it's opening or closing by reading from the start of the sentence, but I'll wait to see if this is a real problem before putting in complex logic like that.

rminsil · 2024-10-11T01:16:59Z

I would put "xri" in the script name somewhere, maybe normalize_xri.py.

My impression was that this script wasn't specific to xri data though. The script is quite generic so I was hoping it would be helpful in other contexts. @mmartin9684-sil FYI

@ddaspit I'll merge the PR as I have others backing up behind it. If we decide to rename it I'll address that in a follow up PR.

rminsil requested a review from ddaspit October 10, 2024 11:20

rminsil force-pushed the issue-494-add-punctuation-normalization branch from 1e9cebd to 25c04a2 Compare October 10, 2024 11:26

ddaspit requested changes Oct 10, 2024

View reviewed changes

rminsil linked an issue Oct 10, 2024 that may be closed by this pull request

Add data transformation and cleaning to XRI etl script #494

Open

Rohan M added 2 commits October 11, 2024 11:41

Issue 494: Add normalization of punctuation

83921b6

Issue 494 PR 557: Change enum to upper case

c8137ad

rminsil force-pushed the issue-494-add-punctuation-normalization branch from 25c04a2 to c8137ad Compare October 11, 2024 00:44

rminsil merged commit dd935b8 into master Oct 11, 2024
1 check was pending

rminsil deleted the issue-494-add-punctuation-normalization branch October 11, 2024 01:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issue 494: Add normalization of punctuation #557

Issue 494: Add normalization of punctuation #557

rminsil commented Oct 10, 2024 •

edited

Loading

bhartmoore commented Oct 10, 2024

ddaspit left a comment

rminsil commented Oct 11, 2024

rminsil commented Oct 11, 2024

rminsil commented Oct 11, 2024

Issue 494: Add normalization of punctuation #557

Issue 494: Add normalization of punctuation #557

Conversation

rminsil commented Oct 10, 2024 • edited Loading

bhartmoore commented Oct 10, 2024

ddaspit left a comment

Choose a reason for hiding this comment

rminsil commented Oct 11, 2024

rminsil commented Oct 11, 2024

rminsil commented Oct 11, 2024

rminsil commented Oct 10, 2024 •

edited

Loading