Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bad performance for other language #76

Closed
JuanFF opened this issue Aug 16, 2022 · 3 comments
Closed

Bad performance for other language #76

JuanFF opened this issue Aug 16, 2022 · 3 comments
Labels
usage wontfix This will not be worked on

Comments

@JuanFF
Copy link

JuanFF commented Aug 16, 2022

Hello,
I'm trying to use the contextual spell checker for Spanish. I run the script in https://github.com/R1j1t/contextualSpellCheck/blob/88bbbb46252c534679b185955fd88c239ed548a7/examples/ja_example.py with the following custom configuration:

import spacy
import contextualSpellCheck

nlp = spacy.load("es_dep_news_trf")

nlp.add_pipe(
	"contextual spellchecker",
	config={
		"model_name": "bert-base-multilingual-cased",
		"max_edit_dist": 2,
	},
)

doc = nlp("La economia a crecido un dos por ciento.")
print(doc._.performed_spellCheck)
print(doc._.outcome_spellCheck)

but I don't get the desired result

La economia a crecido un dos por ciento should be corrected as La economía ha crecido un dos por ciento
Instead, I get La economia a crecido un dos por cento

If I use another pre-trained model (e.g. "model_name": "PlanTL-GOB-ES/roberta-large-bne") , the result keeps wrong:
Laeconomiaacrecidoundosporciento. ??
I wonder if I'm using the proper script to run the spellchecker in another language.

@JuanFF JuanFF added the bug Something isn't working label Aug 16, 2022
@JuanFF JuanFF changed the title [BUG] Bad performance for other language Aug 16, 2022
@R1j1t
Copy link
Owner

R1j1t commented Aug 17, 2022

Hi @JuanFF, I have the following 2 observations:

  1. contextualSpellCheck would be unable to change "a" to "ha". Details here

  2. The problem with "ciento" is because of the bert model bert-base-multilingual-cased. Suppose the user passes no vocabulary (vocab) file. In that case, it uses the vocab of the bert model, and "ciento" is not available in it:

     ```
     >>> tokenizer = AutoTokenizer.from_pretrained('bert-base-multilingual-cased')
     >>> 'ciento' in tokenizer.get_vocab()
     False
     >>> doc._.suggestions_spellCheck
     {ciento: 'cento'}
     >>> # 'cento' is hundred in Portuguese (Brazil)
     >>>
     ```
    

If you dont want to change the bert model, I would suggest to pass the vocab file (example) separately like:


>>> vocab_path = "es_vocab.txt" 
>>> 
>>> nlp.add_pipe(
...     "contextual spellchecker",
...     config={
...             "model_name": "bert-base-multilingual-cased",
...             "max_edit_dist": 2,
...             "vocab_path": vocab_path
...     },
... )
testVocab.txt
inside vocab path
file opened!
Inside [unused....]
<contextualSpellCheck.contextualSpellCheck.ContextualSpellCheck object at 0x7fa607daee80>
>>> doc = nlp("La economia a crecido un dos por ciento.")
>>> print(doc._.performed_spellCheck)
True
>>> print(doc._.outcome_spellCheck)
La economia a crecido un dos por ciento.
>>> 

@R1j1t
Copy link
Owner

R1j1t commented Aug 17, 2022

I have a pending issue #44 on a similar topic, but lately, I have been pretty occupied. If you think you can contribute, please open a PR! The project would be glad to have your contribution!

@R1j1t R1j1t added usage and removed bug Something isn't working labels Aug 21, 2022
@stale
Copy link

stale bot commented Sep 20, 2022

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.

@stale stale bot added the wontfix This will not be worked on label Sep 20, 2022
@stale stale bot closed this as completed Sep 27, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
usage wontfix This will not be worked on
Projects
None yet
Development

No branches or pull requests

2 participants