-
Notifications
You must be signed in to change notification settings - Fork 165
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Discrepancy between how NER Annotator and spaCy are handling certain Unicode characters #119
Comments
Hello @elifbeyzatok00. Your code seems fine to me. I also tried using it with a small annotation file and it worked as expected. Could you please provide a sample annotation file for which you're facing the issue so that I can investigate further? |
Thank you for your feedback. @alvi-khan I've attached the sample file where I encountered the problem so you can investigate further. |
Thanks @elifbeyzatok00. I've managed to replicate the issue now. It seems there's some discrepancy between how NER Annotator and spaCy are handling certain Unicode characters, specifically '🔗' in this case. If it is acceptable for your use case, an easy workaround is to just replace all instances of '🔗' with two spaces. I've attached a copy of the annotation file you provided in which I have made this replacement. sample_with_emoji_replaced.zip As you can see in the attached screenshot, it works correctly after this change. |
For a slightly more technical analysis of why this is happening, it seems that our tokenizer interprets the emoji '🔗' as two characters whereas spaCy interprets it as a single character. This results in the starting position for each entity after the emoji having an 'off by one' error. If we use multiple emojis, the effect becomes cumulative. I've attached a minimal reproducible example which clearly shows this issue. Text File: text.txt From NER Annotator: From spaCy: For this piece of text, the exported annotation is: {"classes":["TEST"],"annotations":[["This part is fine - 🔗 - but this part is not.",{"entities":[[5,9,"TEST"],[34,38,"TEST"]]}]]} Here, the second entity starts from index 34, which means there should be 34 characters in front of it. But if we count the characters (counting '🔗' as a single character), we will see that there are actually 33 characters in front of it. The first entity does not have this issue, correctly starting from index 5. We can also see that '🔗' is being interpreted as two characters if we switch the annotation precision to 'Character Level'. I'll need some time to properly understand why this discrepancy exists and how to resolve it. @tecoholic, since this is related to the tokenization process, I would appreciate any hints you might be able to provide. |
@alvi-khan Thank for the thorough investigation of the issue. Can you kindly see if the NLTK Tokenizer in Python also produces the same effect? As in, does it also count the Unicode as 2 characters? Since the JS tokenizer we use is a port of the NLTK tokenizer, I suspect that would be the case. If it turns out the NLTK tokenizer also has the same issue, then we will need to update our tokenizer to follow the Spacy Tokenizer as this is after all NER Annotator for Spacy. Sidenote: This might be a good update to the software, we might end up non-english annotations properly as well. |
@alvi-khan @tecoholic Thank you very much for all your help. I will clean the emojis before exporting the txt files to the NER Annotation Tool. In this way, I will prevent character shifts caused by emojis. I'm impressed that you responded so quickly and investigated the issue thoroughly. You have a great team. Thank you very much again.🤩 |
I had a feeling this was an encoding issue. In the JS port: const TreebankTokenizer = require('treebank-tokenizer');
tokenizer = new TreebankTokenizer();
console.log(tokenizer.span_tokenize("🔗")) Output: In Python: from nltk.tokenize import TreebankWordTokenizer
list(TreebankWordTokenizer().span_tokenize("🔗")) Output: @tecoholic, you probably already know this, but for reference, the In JS A detailed discussion on why this occurs is available here. We can of course modify the the We need to decide whether we want to make this change or leave it as a known issue that we won't (or rather can't) fix. Since the issue, fundamentally, occurs due to a difference in how the two languages handle strings, changing the behavior may break things in unexpected and confusing ways. There might also be an alternative approach that allows us to handle the issue in the Python script, but I can't think of one at the moment. |
I wanted to display a json file labeled with spacy displacy. But the problem persists.
I carefully label in the tool:
When I view it with spacy displacy, irrelevant places are labeled, but the places that should be are not:
The code that I used to view labeled text with spacy displacy:
The text was updated successfully, but these errors were encountered: