Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add data transformation and cleaning to XRI etl script #494

Open
rminsil opened this issue Aug 23, 2024 · 4 comments · Fixed by #515, #517, #518, #526 or #527
Open

Add data transformation and cleaning to XRI etl script #494

rminsil opened this issue Aug 23, 2024 · 4 comments · Fixed by #515, #517, #518, #526 or #527
Labels
enhancement New feature or request pipeline 2: extract Issue related to extracting parallel corpora

Comments

@rminsil
Copy link
Collaborator

rminsil commented Aug 23, 2024

Overview

Add data transformation/cleaning to the XRI script outlined in the parent issue: #472

Background

The creation of an initial script was captured in #473 but there's currently no data transformation logic.

Known issues

Examples of potential data quality issues that have been observed so far:

Split sentences

Record 548 in Ruwila.2024_08_12_normalized.tsv splits a sentence over 2 lines:

547	Katika mabadiliko ya kushangaza, mtumishi huyo wa chini alipiga upinde mbele ya huo umati, akibadilisha kawaida ya mwisho.	Kwu mandiko ga kutontomiya, mutumami weejo wa hansi ndiwaguma bhuta habhutongelo bhwa idaale lyelyo,akugeluzya vyevyovili vya sipelo.	train
548	Vitendo vya wanajamii wa mlima huo kipindi cha tetemeko la ardhi vilihusisha kustawisha na kuliamsha tumaini pamoja na uthabiti.	Vintu vya bhikazi bha musozi googo kazye ka limusikimiya lya nsyi 	
  ndivyatwalila kutisisya ni kulitunuzya lilungemwe hamwi ni lutonto	train		
549	Uwanja wa zamani wa Korah unadumu kama ushuhuda wa nguvu wa umuhimu wake wa kihistoria.	Isesa lya kale lya kolahi gukwikala mfuku kuti bhubhoni bhwa naga nsokose zyakwe zya gakale.	train

I've only seen one instance of this above. The corresponding unnormalized file didn't have this so I think it was introduced by normalization. Making the script intelligent enough to handle this might be going too far.

Double spaces

Double spaces in the middle of sentences seems very common.

e.g. about 750 of these in Ruwila.2024_08_12_normalized.tsv

552	Alikuwa tayari kuendelea kusukuma kotekote  kwa ajili ya marekebisho, ingawa kungehatarisha nafasi yake na kusababisha ashitakiwe.	Ndi waigelanizya kusenseleeka kutenka konsekonse kulungama ni manogelezyo,ni nga vyevyo mbe kukatwalile fwasi jaakwe ni kutwalila atulilwe ibhanza.	train
                                                  ^^

Missing translations with !

The exclamation point ! is used where XRI translators couldn't produce a vernacular translation.

e.g. in ngq_parallel_dataset_unified_2024-07-12_15-28-15_1.tsv:

1	Nyaraka za kale za ufafanuzi zilitumiwa mapema alfajiri.	!	train
17	Tulifurahia siku refu ya mandhari kali, nzuri.	!	train
99	Bali umivu, hakuna anayeweza kusema ukweli wa mwisho isipokuwa Mungu.	!	train

Inconsistent spacing around punctuation

e.g. record 11 from Konongo.2024_08_12.tsv)

11	Kʉnyanzo ya goose, kʉngʉno bhaakʉmʉyomba ndɨalɨniʉkenagʉzɨ wa ɨgwilila,hado akaleka mbeela ʉmlɨmo gwakwe agisi ,yɨyo akwikola ɨlɨbheezya.	train
                                                                              ^                                        ^

e.g. record 98 from ngq_parallel_dataset_unified_2024-07-12_15-28-15_1.tsv:

98	... 	Enkagha yenyangi ,omwitabhi we Egetema nainanghore bhukong'u nokumaruho naghwere ,okwereki echenguru :chamasambho .	train
                                 ^                                                               ^                                ^

Typos

In the below example, the letter 'O' was replaced with the number '0' in record 3674 of swa_ngq_4k_sf.tsv for the vernacular translation:

0bhosineenu bho  abhachiira bhano, bhakumerri kwanga
^

Leading colon characters

There's many examples in Ngoreme, swa_ngg_4k_sf.tsv of a : at the start of a word:

174	 ... 	Omweghi  uno arekubherekeru Choni akabhugha!, "nobhukaghi okwangha ghuteghera amang'ana ghaane :righeri noghi kibhara  wangharre  bhukong'u gho kere ubherekeriru".	train
                                                                                                               ^

I'm not sure if this is part of the Ngoreme language and how we should treat it.

Potential issues I can't replicate

These are issues we might encounter but I haven't been able to find examples of them so far in the .tsv data I've been given. Note that at time of writing I only have tsv files for Tanzanian languages Ngoreme (ngq), Konongo (kcz), Ruwila (rwl). We have older json based XRI files for Indonesian languages.

Leading and trailing whitespace

Leading/trailing whitespace before a LWC or vernacular translation would potentially cause issues.

I searched across the datasets I've been given for tab characters followed by space and found nothing except in the review files:

rg '\t '
(no output)

rg ' \t'
(no output)

We can still handle it anyway, it's not complex.

Duplicate LWC sentences

It's possible that a sentence will get entered multiple times and then produce multiple conflicting translations. To test for this, I hacked this into my ETL script:

+def check_for_duplicates(sentence_pairs) -> None:
+    source_set = set([sentence_pair.source for sentence_pair in sentence_pairs])
+    if len(source_set) != len(sentence_pairs):
+        print("Not the same!!!!")
+
+
 def run(cli_input: CliInput) -> None:
     sentence_pairs = load_sentence_pairs(cli_input.input_file_path)
-    create_extract_files(cli_input, sentence_pairs)
+    check_for_duplicates(sentence_pairs)
+    # create_extract_files(cli_input, sentence_pairs)

then ran it across the files I'd been sent and it printed nothing.

Other stuff

Some things I've noticed that are noteworthy but not necessarily data quality issues which the script could fix:

Missing id's

I noticed 137 examples of missing id's in swa_ngq_4k_sf.tsv:

54	Ukiukwaji mkubwa wa haki katika kanuni zinazodhibiti uchafuzi wa hewa huishia kuathiri jamii zilizo mbali na vyanzo vya uchafuzi huo.	Oghusari amaragheri kobhonene bhomoso, ghochichera chino chitooru okureki okoghundi omwika nihaserebhera oghusari ehamate ino imenyire kure chinsoro cha amanche.	train
(no 55)
56	Mwanamuziki wa kinubi angeweza kufanya timua vikwazo vyake huku kinywa chake kikibeba melodi vizuri.	Omutemi we eritinghwe nghutora are arusi amahaghe ghaache na kehayo komonyu ghoonche narentiri eribhina richomu.	train

...

82	Alikosa kufunika misingi yetu na tukose nafasi hii wakati twende na mapepo yanapopewa nafasi ya kufunika mabawa.	Nasarri okobhisa esemoka yiito, no okubhori omweya ghuno, ko enkagha ino amasambho hano ghakuhabhu echenghuru.	train
(no 83)
84	Shujaa alikabidhi mkono wake kwa mtu kama ishara ya dhabihu.	Omukare werihe, nahanire okubhoko ghooche komonto kebhore yekierekeri ke eghisenghero.	train

...

103	Ilikuwa dhahiri kwamba mwanambuzi alikuwa amepanda juu ya paa, wakati watu wote walipokuwa wakimkimbiza mhalifu.	Yabhaire okurabhu agho esubheni wabhwine erinire konyumbha ighoro koghisara  enkagh aabhanto bhaansi bhare komuhebhani omwibhi.	train
(no 104)
105	Mpango wa mkutano huo ulikuwa kujadili jinsi mfanyakazi mmoja, aliyekuwa na jukumu muhimu katika  mradi ikiwa ataadhibiwa kwa  kuwadhihaki wenzake, jambo ambalo halikubaliki.	Ubhunaghi bhwe erikomo riyo bhwarekughabhera kebhore omukoriwemeremo omwe ,uno abhaire omunaghi omokoro komoremo ghuyo ,urabhone bharamwita echagho nokomusera bhaghendi bhaache ,eng'ana ino itareghwitabheribhu.	train

...

119	Hivi karibuni, kila mtu atafuata mpango ambao tuliokubaliana, kama tulivyoahidi.	Ghonkagha ino eghucha, omonto wunsi aratuna obhusemi bhuno twitwbheraini, kebhore twaghambhaine.	train
(no 120)
121	Aliamua kuomba ruhusa kutoka kwa Mungu na kuwa sehemu pekee alikokuwa akisubiri jibu la uponyaji.	Nasemiri oghusabha omweya, okoru hare Ghetema nokubha ahase ahene  ahaghare arekughanyeri rihonchokiri ryo obhuhori.	train

For Ruwila and Konogo there were no missing records.

UPDATE: Michael asked XRI about this. The missing entries correspond to sentences that haven't been translated yet.

Plan

  • gather any other example of known or anticipated data quality issues from Michael Martin
  • determine scope for the ones we want to tackle
  • break the work up into a few PR's linked to this issue
@rminsil rminsil added enhancement New feature or request pipeline 2: extract Issue related to extracting parallel corpora labels Aug 23, 2024
@rminsil
Copy link
Collaborator Author

rminsil commented Sep 12, 2024

Michael suggested putting some basic cleaning in the current script, and moving more advanced normalization into a separate script.

My thinking is that the extraction script should primarily focus on getting all of the valid sentence pairs out of the XRI dataset, while the normalization script should handle all of the auto-correction tasks.

The sections below define the responsibilities of the two scripts from the perspective of data quality.

Extraction Script Responsibilities

  • Filtering or correcting split sentences
    • Discard sentence pairs that don't follow the expected format
    • Stretch goal would be to re-assemble and retain sentence pairs that are split over 2 lines because of an embedded newline
  • Removal of Missing translations (indicated by !)
  • Removal of leading/trailing whitespace on the sentence pairs
  • Duplicate LWC entries
    • Log instances at error level

Normalization Script Responsibilities

  • Convert multiple consecutive spaces to a single space
  • Correct inconsistent spacing around punctuation

Future normalizations to consider:

  • Quotation mark normalization (to be defined).
  • Capitalization normalization (to be defined).
  • Leading colon - not sure about this one, let's defer it for the moment.

@ddaspit
Copy link
Collaborator

ddaspit commented Sep 12, 2024

@mmartin9684-sil Do we know why the split sentences occur?

@rminsil
Copy link
Collaborator Author

rminsil commented Sep 13, 2024

@mmartin9684-sil Do we know why the split sentences occur?

@ddaspit I have only seen one instance of this and it was from a normalized file. The original unnormalized file didn't have the newline, so I suspect it was added by the normalization process.

See the heading "Split sentences" in the issue description for the example.

@rminsil
Copy link
Collaborator Author

rminsil commented Oct 10, 2024

Normalization around punctuation

The next task on the list from #494 (comment) is:

Correct inconsistent spacing around punctuation

This will be done in the the normalize_extracts.py script.

I asked the team for clarification on what is considered punctuation because that is something that could be very language dependent.

The current default list is:

  • . (period)
  • , (comma)
  • ' (single quote ascii)
  • " (double quote ascii)
  • : (colon)
  • ; (semicolon)
  • ? (question mark)
  • ! (exclamation point)
  • - (hyphen)
  • / (forward slash)
  • <> (angle brackets)
  • () (parentheses)
  • ¿ (upside down question mark)
  • ¡ (upside down exclamation point)

But the script will allow specialized lists to be passed in.

Note some of these characters have exotic unicode variations, for example:

  • «» (double angle brackets)
  • ‘’ (U+2018/2019, cursive single quotes)
  • “” (U+201C/201D, cursive double quotes)
  • (U+FF0D, full width hyphen)

and some software will sneak these in when you input certain characters, e.g. Microsoft word.

Other comments from the team

In relation to whitespace around punctuation:

(Bethany) I noticed in the XRI normalized data that all punctuation is set off from all words with whitespace. I'm not sure we want that here, because we want the model to learn how punctuation is used normally.
One common issue that definitely creates problems is the word,word scenario, where our tools treat this as a single word.

In relation to double apostrophes:

(Bethany) I've also noticed that different translators in the same language use quotes very differently: some use two apostrophes for a double quote, and some use both single and double quotes to set off every quotation. I'm not sure what you and Michael may have already discussed, but it may be appropriate to reduce all adjacent quotation marks to either a single ' or ".

I would do ^ as a separate PR and make it possible to disable.

In relation to false positives:

(Damien) Punctuation normalization can be tricky, because every language has different punctuation rules. Is the plan just to do a best effort to clean up some of the more egregious punctuation issues? We need to be careful that we aren't incorrectly "fixing" punctuation issues that actually aren't issues.

(Bethany) Yes, this is to clean up the human-generated data we're getting from XRI. So far we don't have any angle brackets or inverted ? or !, though that could come up. The most common issues I've observed are words "joined" by punctuation and extreme variability in the use of quotation marks.


Challenges with punctuation lists

I anticipate a some issues with trying to define a set of punctuation and normalization logic:

  • false negatives (under normalization) - we will end up missing some punctuation markers that are outside our set because a language uses characters we didn't anticipate
  • false positives (over normalization) - in an attempt to reduce false negatives, we will end up introducing more punctuation characters, but some of those characters also have non-punctuation meanings in the same language and those characters get normalized when they shouldn't be

Expanding on the last point, there's so many examples from English:

  • single quote ' is used as
    • apostrophe of possession or contraction, e.g. "it's" "can't" "Mc'Barson" "e.g."
    • quoting: e.g. "he said 'hi'"
  • hyphen - can be used to:
    • join two words, e.g. "non-empty"
    • grammatically act like a semi-colon/colon to elaborate, e.g. "We climbed the tall mountain - (that is) Mt Fuji"
  • punctuation characters are frequently used in domain specific contexts as delimiters
    • Bible verse: "John 3:16" - you don't want spacing after the ':' here
    • email: "boban@sil.org" - period is not punctuation here
    • ip address: "192.168.3.15" - period is not punctuation here
  • miscellaneous stuff
    • e.g. ellipsis "..." (3 periods, but they aren't grammatical)

The spacing rules around those characters are different depending on how they're used. For example "it's" has no spacing around the single quote, but "he said 'hi'" should have spacing left of the opening single quote and right of the closing quote.

The script will make this easier to tackle if it provides some tools:

  • a clear summary of the normalizations that have taken place so that a human with domain knowledge for that language can smell check the changes visually
  • it reports instances where it suspects one of the problems above
  • producing a patch file which captures the changes and allows someone to manually edit the patch and reapply it themselves on the original
  • it provides configurability from the command line around what is considered punctuation and what category

Categories of punctuation

There's a few different categories of punctuation I can think of from English:

  • right clinging, ie. it hangs off the right side of a word and is followed by a space when it's not ending the sentence, e.g. ., ? and closing brackets of all kinds - "Hello there." and "How are you? Good?"
  • left clinging, like above but on the left side of a word, e.g. opening brackets of all kinds and cursive opening single quotes like (+U8216), e.g. "She said ‘Hello there’"
  • left-right clinging - can cling to either side of a word, e.g. ascii single quote (') and double quote ("), e.g. 'I said "Hello there"'
  • unclinging - characters that should always have a space on either side and wouldn't usually end a sentence, e.g. hyphen, e.g. "I climbed the tallest mountain - the one they call Mt Fuji"

Configuration file

The definition of punctuation changes with each language so it will be supplied as an input file, e.g.

# Comma ','
U+0044 RIGHT_CLINGING

# Period '.'
U+0046 RIGHT_CLINGING

# Left parentheses '('
U+0028 LEFT_CLINGING

# Right parentheses ')'
U+0029 RIGHT_CLINGING

# Full width hyphen '-'
U+FF0D UNCLINGING

The file allows comments (lines starting with '#') and blank lines. These are ignored.

Each entry is a hexadecimal unicode representation of a character, then a description of the category of punctuation it falls into.

I will provide a sample text block in the script itself containing many common punctuation characters. Users can copy-paste for their purposes and comment out or delete the ones they don't need.

I chose to use hexadecimal codes rather than the character itself because:

  • not all text editors know how to render all characters which would make it hard for someone to manually edit the config file
  • some characters cause text editors to change behaviour, e.g. there are some right-to-left languages like Arabic and Hebrew and when editors detect those characters they change how they render the line to be right-to-left
  • some characters don't render well in isolation and are intended to be married to another "big" character (for example some accents)
  • it's easier to write a parser when you can assume all the characters are Latin - some exotic unicode characters might do funny things to the parser
  • it is easier for a human to translate from unicode to character, than from character to unicode

So I will rely on a convention of people adding a comment above each entry interpreting what the character is and providing the rendered character if that makes sense.

A character can have at most one entry in the file - it doesn't make sense for a character to be in two categories simultaneously as you would have ambiguity over how to normalize the spacing around it.

Related to false positives above, if a punctuation character sometimes has a non-punctuation function (e.g. "it's"), the script will need a way to understand when it's used as punctuation. It's also possible there may be some edge cases we come across where a punctuation character has two different grammatical punctuation functions which require different normalization. The configuration file would need to introduce a way to differentiate these based on context. Then it can allow multiple entries for the character differentiated by context. I'll deal with those issues in a later iteration when I have some real examples to work with.


Normalization rules

Idempotency

I will try to make the normalization logic as idempotent as possible, meaning that normalizing a file twice is the same as normalizing it once.

              normalize         normalize
       file1 ----------> file2 ----------> file3
                                   ==

The benefit of this is that it's okay if an already normalized file accidentally gets included in a bundle of files to process.

Assumptions

The requirements below are assuming that the sentences have whitespace trimmed off the boundaries and that the sentences are non-empty.

Left clinging

In simple terms, when a left clinging character is encountered, the whitespace to the left is contracted to a single space, and the whitespace to the right is removed, e.g.

She said  ( and she's my friend ...
          ^

She said (and she's my friend ...

There are edge cases at the boundary to consider. This table captures all the cases:

     Position of
     punctuation          Changes made   Changes made
     within               left of        right of
     sentence             punctuation    punctuation
    --------------------------------------------------------

    Start of              <none>         remove
    sentence                             right
                                         whitespace

    e.g. "( Hi"                     "(Hi"
          ^
    --------------------------------------------------------

    Middle of             shrink to      remove all
    sentence              one space      space


    e.g. "A  ( B"                  "A (B"
             ^                        ^
    --------------------------------------------------------

    End of                shrink to     <none>
    sentence              one space


    e.g. "A   ("                   "A ("

If a left clinging character is found at the end of the sentence, the script will raise some kind of warning.

Right clinging

It's the conceptual mirror of left clinging:

     Position of
     punctuation          Changes made   Changes made
     within               left of        right of
     sentence             punctuation    punctuation
    --------------------------------------------------------

    Start of              <none>         shrink to
    sentence                             one space


    e.g. ")  Hi"                    ") Hi"
          ^                          ^
    --------------------------------------------------------

    Middle of             remove all     shrink to
    sentence              space          one space


    e.g. "A )  B"                  "A) B"
            ^                        ^
    --------------------------------------------------------

    End of                remove all    <none>
    sentence              space


    e.g. "A   )"                   "A)"
              ^                      ^

If a right clinging character is found at the start of the sentence, the script will raise some kind of warning.

Left-Right Clinging

This one is more tricky because you need to use context to understand if it's left or right clinging:

She said  'and she's my friend'  and I agreed
          ^                   ^

She said 'and she's my friend' and I agreed

The first single quote is left clinging, so the space to its left is shrunk to one character.

The second single quote is right clinging, so the space to its right is shrunk to one character.

There's ambiguous situations like:

She said ' and she's my friend'  and I agreed
         ^
         ?

There's space on either side of the character so you don't know which space to remove. In those situations, the script will raise a warning and not change anything. It could figure out if it's opening or closing by reading from the start of the sentence, but I'll wait to see if this is a real problem before putting in complex logic like that.

Unclinging

Spacing on each side is changed to exactly one space except at the boundaries:

She said -and I quote ...
         ^

She said - and I quote ...

When an unclinging character is at the boundary the script will raise a warning.

Undefined situations / consecutive punctuation

Consecutive punctuation can cause some edge case situations that you can't resolve with the rules above.

For example a left clinging character followed by an unclinging character: "(-"

  • the left clinging character requires no space between it and the character to its right
  • the unclinging character requires a space between it and the character to its left
  • (you can't satisfy both)

If there are two consecutive punctuation characters, the situation is probably unusual and requires a human. For example Bethany mentioned how sometimes two single quotes are sometimes used to represent a double quote. In those weird situations the script will bail out with a warning and not normalize that part of the sentence as there's a good chance the punctuation characters are doing something unique and novel which the script isn't designed to handle.

It also might make sense to bail out when punctuation is at the boundary of a sentence in an unexpected way, e.g.

  • a left clinging character ending a sentence
  • a right clinging character starting a sentence
  • a left-right clinging character not touching another charcter
  • an unclinging character starting or ending a sentence

False negative/positive warnings

Above I list out potential issues with the script

  • false negatives (under normalization)
  • false positives (over normalization)

Heuristics built into the script could try to detect these and warn the user. These would just be warnings - the script won't override the user's configuration.

The tolerance for these warnings can be tuned, ie. only warn about certain characters once the number of false negatives goes over some threshold percentage. Otherwise the script might produce a snow storm of warnings.

False negative detection (under normalization)

This is the script looking for characters it suspects are punctuation, but the configuration doesn't include them.

Hypothetical example: the configuration is missing cursive opening and close quotes and the script notices this and warns:

Then he said: ‘Hello’
              ^     ^

The script has a "catch-all" definition of punctuation which is:

any character that is not considered a unicode letter or number or whitespace

Regex supports unicode categories for letters (\p{L}) and numbers (\p{N}) that goes beyond just typical Latin A-Z, 0-9. Here's an example searching for numbers where it finds:

  • ٣ Arabic Indic 3 (U+0663)
  • 0
  • Chinese digit 4 (U+3195)
>>> import regex
>>> regex.findall("""\p{N}""", "abc٣edf0def㆕")
['٣', '0', '㆕']

In the example above it would notice the quotes as not being a letter or numbers and suspect them as being punctuation. The characters are also clinging to the edge of a word which increases the chance they are punctuation.

False positive detection (over normalization)

False positives are where a punctuation character gets normalized, but it's not acting like a grammatical punctuation character.

# Original
It's time to visit Sam-the-man

# Normalized
It' s time to visit Sam - the - man.

In all the false positive examples I've been able to think of, the punctuation character is surrounded by non-whitespace characters. So I will make that the heuristic that triggers a warning.

Scope notes

For the next block of work I do:

In scope

  • users defining configuration files for their punctuation
  • basic heuristics trying to warn of false negative/positives
  • normalization logic for left, right, left-right and unclinging punctuation
  • a description of the normalizations applied logged out
  • a patch generated summarising the changes
  • (?) colored output with ansi escape codes for the normalizations

Out of scope

  • smart transformations of punctuation characters, e.g. '' -> " (2xsingle quote -> 1xdouble quote) mentioned by Bethany
  • context dependent analysis of punctuation to understand what function it plays in the sentence (relates to false positives)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment