Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Discrepancy between how NER Annotator and spaCy are handling certain Unicode characters #119

Open
elifbeyzatok00 opened this issue Aug 8, 2024 · 7 comments
Labels
bug Something isn't working help wanted Extra attention is needed

Comments

@elifbeyzatok00
Copy link

elifbeyzatok00 commented Aug 8, 2024

I wanted to display a json file labeled with spacy displacy. But the problem persists.

I carefully label in the tool:
image

When I view it with spacy displacy, irrelevant places are labeled, but the places that should be are not:
image

The code that I used to view labeled text with spacy displacy:

import json
import spacy
from spacy import displacy

# Spacy modelini yükle
nlp = spacy.load("en_core_web_sm")

# JSON dosyasının yolunu belirtin
file_path = "/content/annotations.json"

# JSON dosyasını açıp verileri yükleyin
with open(file_path, 'r', encoding='utf-8') as file:
    data = json.load(file)

    if 'annotations' in data:
        for annotation in data['annotations']:
            if annotation is not None:
                text = annotation[0]  # Metin
                entities = [(ent[0], ent[1], ent[2]) for ent in annotation[1]['entities']]  # Varlıklar

                # Displacy için gereken formatta veriyi hazırlayın
                spacy_displacy_data = {
                    "text": text,
                    "ents": [{"start": start, "end": end, "label": label} for start, end, label in entities],
                    "title": None
                }

                # Displacy ile görselleştirme yapın
                displacy.render(spacy_displacy_data, style="ent", manual=True, jupyter=True)
@alvi-khan
Copy link
Collaborator

Hello @elifbeyzatok00. Your code seems fine to me. I also tried using it with a small annotation file and it worked as expected.

Could you please provide a sample annotation file for which you're facing the issue so that I can investigate further?

@elifbeyzatok00
Copy link
Author

Thank you for your feedback. @alvi-khan

I've attached the sample file where I encountered the problem so you can investigate further.

sample.zip

@alvi-khan
Copy link
Collaborator

alvi-khan commented Aug 8, 2024

Thanks @elifbeyzatok00. I've managed to replicate the issue now.

It seems there's some discrepancy between how NER Annotator and spaCy are handling certain Unicode characters, specifically '🔗' in this case.

If it is acceptable for your use case, an easy workaround is to just replace all instances of '🔗' with two spaces. I've attached a copy of the annotation file you provided in which I have made this replacement.

sample_with_emoji_replaced.zip

As you can see in the attached screenshot, it works correctly after this change.

Screenshot 2024-08-09 003915

@alvi-khan
Copy link
Collaborator

alvi-khan commented Aug 8, 2024

For a slightly more technical analysis of why this is happening, it seems that our tokenizer interprets the emoji '🔗' as two characters whereas spaCy interprets it as a single character. This results in the starting position for each entity after the emoji having an 'off by one' error. If we use multiple emojis, the effect becomes cumulative.

I've attached a minimal reproducible example which clearly shows this issue.

Text File: text.txt
Annotations: annotations.json

From NER Annotator:

Screenshot 2024-08-09 005326

From spaCy:

image

For this piece of text, the exported annotation is:

{"classes":["TEST"],"annotations":[["This part is fine - 🔗 - but this part is not.",{"entities":[[5,9,"TEST"],[34,38,"TEST"]]}]]}

Here, the second entity starts from index 34, which means there should be 34 characters in front of it. But if we count the characters (counting '🔗' as a single character), we will see that there are actually 33 characters in front of it. The first entity does not have this issue, correctly starting from index 5.

We can also see that '🔗' is being interpreted as two characters if we switch the annotation precision to 'Character Level'.

image

I'll need some time to properly understand why this discrepancy exists and how to resolve it. @tecoholic, since this is related to the tokenization process, I would appreciate any hints you might be able to provide.

@alvi-khan alvi-khan changed the title When I examined the labeling I made in the Tool with Spacy Displacy, I noticed problems with many labeling. Discrepancy between how NER Annotator and spaCy are handling certain Unicode characters Aug 8, 2024
@alvi-khan alvi-khan added the bug Something isn't working label Aug 8, 2024
@tecoholic
Copy link
Owner

@alvi-khan Thank for the thorough investigation of the issue. Can you kindly see if the NLTK Tokenizer in Python also produces the same effect? As in, does it also count the Unicode as 2 characters? Since the JS tokenizer we use is a port of the NLTK tokenizer, I suspect that would be the case.

If it turns out the NLTK tokenizer also has the same issue, then we will need to update our tokenizer to follow the Spacy Tokenizer as this is after all NER Annotator for Spacy.

Sidenote: This might be a good update to the software, we might end up non-english annotations properly as well.

@elifbeyzatok00
Copy link
Author

@alvi-khan @tecoholic Thank you very much for all your help. I will clean the emojis before exporting the txt files to the NER Annotation Tool. In this way, I will prevent character shifts caused by emojis.

I'm impressed that you responded so quickly and investigated the issue thoroughly. You have a great team. Thank you very much again.🤩

@alvi-khan
Copy link
Collaborator

alvi-khan commented Aug 9, 2024

I had a feeling this was an encoding issue.

In the JS port:

const TreebankTokenizer = require('treebank-tokenizer');

tokenizer = new TreebankTokenizer();

console.log(tokenizer.span_tokenize("🔗"))

Output: [ [ 0, 2 ] ]

In Python:

from nltk.tokenize import TreebankWordTokenizer

list(TreebankWordTokenizer().span_tokenize("🔗"))

Output: [(0, 1)]

@tecoholic, you probably already know this, but for reference, the span_tokenize method in both the JS port and the original Python version call a align_tokens method, which can be found in the utils package for both. This in turn uses the length of each token to determine the start and end of the span. This is where our problems start.

In JS "🔗".length == 2 but in Python len("🔗") == 1.

A detailed discussion on why this occurs is available here.

We can of course modify the the align_tokens method in the JS port to force it to always give the same results as the Python variant by counting the unicode scalar values instead (which is how Python determines string lengths). In JS, the Array.from method does this for us, so Array.from("🔗").length == 1. A better example may be the '🤦🏼‍♂️' emoji, for which Array.from("🤦🏼‍♂️").length == 5. This is the same in Python, where len("🤦🏼‍♂️") == 5.

We need to decide whether we want to make this change or leave it as a known issue that we won't (or rather can't) fix. Since the issue, fundamentally, occurs due to a difference in how the two languages handle strings, changing the behavior may break things in unexpected and confusing ways.

There might also be an alternative approach that allows us to handle the issue in the Python script, but I can't think of one at the moment.

@tecoholic tecoholic added the help wanted Extra attention is needed label Dec 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

3 participants