Bad performance for other language #76

JuanFF · 2022-08-16T09:29:55Z

Hello,
I'm trying to use the contextual spell checker for Spanish. I run the script in https://github.com/R1j1t/contextualSpellCheck/blob/88bbbb46252c534679b185955fd88c239ed548a7/examples/ja_example.py with the following custom configuration:

import spacy
import contextualSpellCheck

nlp = spacy.load("es_dep_news_trf")

nlp.add_pipe(
	"contextual spellchecker",
	config={
		"model_name": "bert-base-multilingual-cased",
		"max_edit_dist": 2,
	},
)

doc = nlp("La economia a crecido un dos por ciento.")
print(doc._.performed_spellCheck)
print(doc._.outcome_spellCheck)

but I don't get the desired result

La economia a crecido un dos por ciento should be corrected as La economía ha crecido un dos por ciento
Instead, I get La economia a crecido un dos por cento

If I use another pre-trained model (e.g. "model_name": "PlanTL-GOB-ES/roberta-large-bne") , the result keeps wrong:
Laeconomiaacrecidoundosporciento. ??
I wonder if I'm using the proper script to run the spellchecker in another language.

The text was updated successfully, but these errors were encountered:

R1j1t · 2022-08-17T18:04:27Z

Hi @JuanFF, I have the following 2 observations:

contextualSpellCheck would be unable to change "a" to "ha". Details here

The problem with "ciento" is because of the bert model bert-base-multilingual-cased. Suppose the user passes no vocabulary (vocab) file. In that case, it uses the vocab of the bert model, and "ciento" is not available in it:

 ```
 >>> tokenizer = AutoTokenizer.from_pretrained('bert-base-multilingual-cased')
 >>> 'ciento' in tokenizer.get_vocab()
 False
 >>> doc._.suggestions_spellCheck
 {ciento: 'cento'}
 >>> # 'cento' is hundred in Portuguese (Brazil)
 >>>
 ```

If you dont want to change the bert model, I would suggest to pass the vocab file (example) separately like:


>>> vocab_path = "es_vocab.txt" 
>>> 
>>> nlp.add_pipe(
...     "contextual spellchecker",
...     config={
...             "model_name": "bert-base-multilingual-cased",
...             "max_edit_dist": 2,
...             "vocab_path": vocab_path
...     },
... )
testVocab.txt
inside vocab path
file opened!
Inside [unused....]
<contextualSpellCheck.contextualSpellCheck.ContextualSpellCheck object at 0x7fa607daee80>
>>> doc = nlp("La economia a crecido un dos por ciento.")
>>> print(doc._.performed_spellCheck)
True
>>> print(doc._.outcome_spellCheck)
La economia a crecido un dos por ciento.
>>>

R1j1t · 2022-08-17T18:08:06Z

I have a pending issue #44 on a similar topic, but lately, I have been pretty occupied. If you think you can contribute, please open a PR! The project would be glad to have your contribution!

stale · 2022-09-20T20:43:43Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.

JuanFF added the bug Something isn't working label Aug 16, 2022

JuanFF changed the title ~~[BUG]~~ Bad performance for other language Aug 16, 2022

R1j1t added usage and removed bug Something isn't working labels Aug 21, 2022

stale bot added the wontfix This will not be worked on label Sep 20, 2022

stale bot closed this as completed Sep 27, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bad performance for other language #76

Bad performance for other language #76

JuanFF commented Aug 16, 2022

R1j1t commented Aug 17, 2022

R1j1t commented Aug 17, 2022

stale bot commented Sep 20, 2022

Bad performance for other language #76

Bad performance for other language #76

Comments

JuanFF commented Aug 16, 2022

R1j1t commented Aug 17, 2022

R1j1t commented Aug 17, 2022

stale bot commented Sep 20, 2022