-
-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add data transformation and cleaning to XRI etl script #494
Comments
Michael suggested putting some basic cleaning in the current script, and moving more advanced normalization into a separate script.
The sections below define the responsibilities of the two scripts from the perspective of data quality. Extraction Script Responsibilities
Normalization Script Responsibilities
Future normalizations to consider:
|
@mmartin9684-sil Do we know why the split sentences occur? |
@ddaspit I have only seen one instance of this and it was from a normalized file. The original unnormalized file didn't have the newline, so I suspect it was added by the normalization process. See the heading "Split sentences" in the issue description for the example. |
Normalization around punctuationThe next task on the list from #494 (comment) is:
This will be done in the the I asked the team for clarification on what is considered punctuation because that is something that could be very language dependent. The current default list is:
But the script will allow specialized lists to be passed in. Note some of these characters have exotic unicode variations, for example:
and some software will sneak these in when you input certain characters, e.g. Microsoft word. Other comments from the teamIn relation to whitespace around punctuation:
In relation to double apostrophes:
I would do ^ as a separate PR and make it possible to disable. In relation to false positives:
Challenges with punctuation listsI anticipate a some issues with trying to define a set of punctuation and normalization logic:
Expanding on the last point, there's so many examples from English:
The spacing rules around those characters are different depending on how they're used. For example "it's" has no spacing around the single quote, but "he said 'hi'" should have spacing left of the opening single quote and right of the closing quote. The script will make this easier to tackle if it provides some tools:
Categories of punctuationThere's a few different categories of punctuation I can think of from English:
Configuration fileThe definition of punctuation changes with each language so it will be supplied as an input file, e.g.
The file allows comments (lines starting with '#') and blank lines. These are ignored. Each entry is a hexadecimal unicode representation of a character, then a description of the category of punctuation it falls into. I will provide a sample text block in the script itself containing many common punctuation characters. Users can copy-paste for their purposes and comment out or delete the ones they don't need. I chose to use hexadecimal codes rather than the character itself because:
So I will rely on a convention of people adding a comment above each entry interpreting what the character is and providing the rendered character if that makes sense. A character can have at most one entry in the file - it doesn't make sense for a character to be in two categories simultaneously as you would have ambiguity over how to normalize the spacing around it. Related to false positives above, if a punctuation character sometimes has a non-punctuation function (e.g. "it's"), the script will need a way to understand when it's used as punctuation. It's also possible there may be some edge cases we come across where a punctuation character has two different grammatical punctuation functions which require different normalization. The configuration file would need to introduce a way to differentiate these based on context. Then it can allow multiple entries for the character differentiated by context. I'll deal with those issues in a later iteration when I have some real examples to work with. Normalization rulesIdempotencyI will try to make the normalization logic as idempotent as possible, meaning that normalizing a file twice is the same as normalizing it once.
The benefit of this is that it's okay if an already normalized file accidentally gets included in a bundle of files to process. AssumptionsThe requirements below are assuming that the sentences have whitespace trimmed off the boundaries and that the sentences are non-empty. Left clingingIn simple terms, when a left clinging character is encountered, the whitespace to the left is contracted to a single space, and the whitespace to the right is removed, e.g.
There are edge cases at the boundary to consider. This table captures all the cases:
If a left clinging character is found at the end of the sentence, the script will raise some kind of warning. Right clingingIt's the conceptual mirror of left clinging:
If a right clinging character is found at the start of the sentence, the script will raise some kind of warning. Left-Right ClingingThis one is more tricky because you need to use context to understand if it's left or right clinging:
The first single quote is left clinging, so the space to its left is shrunk to one character. The second single quote is right clinging, so the space to its right is shrunk to one character. There's ambiguous situations like:
There's space on either side of the character so you don't know which space to remove. In those situations, the script will raise a warning and not change anything. It could figure out if it's opening or closing by reading from the start of the sentence, but I'll wait to see if this is a real problem before putting in complex logic like that. UnclingingSpacing on each side is changed to exactly one space except at the boundaries:
When an unclinging character is at the boundary the script will raise a warning. Undefined situations / consecutive punctuationConsecutive punctuation can cause some edge case situations that you can't resolve with the rules above. For example a left clinging character followed by an unclinging character: "(-"
If there are two consecutive punctuation characters, the situation is probably unusual and requires a human. For example Bethany mentioned how sometimes two single quotes are sometimes used to represent a double quote. In those weird situations the script will bail out with a warning and not normalize that part of the sentence as there's a good chance the punctuation characters are doing something unique and novel which the script isn't designed to handle. It also might make sense to bail out when punctuation is at the boundary of a sentence in an unexpected way, e.g.
False negative/positive warningsAbove I list out potential issues with the script
Heuristics built into the script could try to detect these and warn the user. These would just be warnings - the script won't override the user's configuration. The tolerance for these warnings can be tuned, ie. only warn about certain characters once the number of false negatives goes over some threshold percentage. Otherwise the script might produce a snow storm of warnings. False negative detection (under normalization)This is the script looking for characters it suspects are punctuation, but the configuration doesn't include them. Hypothetical example: the configuration is missing cursive opening and close quotes and the script notices this and warns:
The script has a "catch-all" definition of punctuation which is:
Regex supports unicode categories for letters (
In the example above it would notice the quotes as not being a letter or numbers and suspect them as being punctuation. The characters are also clinging to the edge of a word which increases the chance they are punctuation. False positive detection (over normalization)False positives are where a punctuation character gets normalized, but it's not acting like a grammatical punctuation character.
In all the false positive examples I've been able to think of, the punctuation character is surrounded by non-whitespace characters. So I will make that the heuristic that triggers a warning. Scope notesFor the next block of work I do: In scope
Out of scope
|
Overview
Add data transformation/cleaning to the XRI script outlined in the parent issue: #472
Background
The creation of an initial script was captured in #473 but there's currently no data transformation logic.
Known issues
Examples of potential data quality issues that have been observed so far:
Split sentences
Record 548 in
Ruwila.2024_08_12_normalized.tsv
splits a sentence over 2 lines:I've only seen one instance of this above. The corresponding unnormalized file didn't have this so I think it was introduced by normalization. Making the script intelligent enough to handle this might be going too far.
Double spaces
Double spaces in the middle of sentences seems very common.
e.g. about 750 of these in
Ruwila.2024_08_12_normalized.tsv
Missing translations with
!
The exclamation point
!
is used where XRI translators couldn't produce a vernacular translation.e.g. in
ngq_parallel_dataset_unified_2024-07-12_15-28-15_1.tsv
:Inconsistent spacing around punctuation
e.g. record 11 from
Konongo.2024_08_12.tsv
)e.g. record 98 from
ngq_parallel_dataset_unified_2024-07-12_15-28-15_1.tsv
:Typos
In the below example, the letter 'O' was replaced with the number '0' in record 3674 of
swa_ngq_4k_sf.tsv
for the vernacular translation:Leading colon characters
There's many examples in Ngoreme,
swa_ngg_4k_sf.tsv
of a:
at the start of a word:I'm not sure if this is part of the Ngoreme language and how we should treat it.
Potential issues I can't replicate
These are issues we might encounter but I haven't been able to find examples of them so far in the
.tsv
data I've been given. Note that at time of writing I only have tsv files for Tanzanian languages Ngoreme (ngq), Konongo (kcz), Ruwila (rwl). We have older json based XRI files for Indonesian languages.Leading and trailing whitespace
Leading/trailing whitespace before a LWC or vernacular translation would potentially cause issues.
I searched across the datasets I've been given for tab characters followed by space and found nothing except in the review files:
We can still handle it anyway, it's not complex.
Duplicate LWC sentences
It's possible that a sentence will get entered multiple times and then produce multiple conflicting translations. To test for this, I hacked this into my ETL script:
then ran it across the files I'd been sent and it printed nothing.
Other stuff
Some things I've noticed that are noteworthy but not necessarily data quality issues which the script could fix:
Missing id's
I noticed 137 examples of missing id's in
swa_ngq_4k_sf.tsv
:For Ruwila and Konogo there were no missing records.
UPDATE: Michael asked XRI about this. The missing entries correspond to sentences that haven't been translated yet.
Plan
The text was updated successfully, but these errors were encountered: