This repository was archived by the owner on Mar 9, 2023. It is now read-only.
This repository was archived by the owner on Mar 9, 2023. It is now read-only.
Problem with user defined dictionary #143
Open
Description
I am making use of sudachipy via ginza, and am trying to annotate the following sentences.
プロ野球の中日で選手、監督を務め、1月4日に70歳で死去した星野仙一氏をしのび、3日、名古屋市東区のナゴヤドームで行われた中日―楽天のオープン戦は追悼試合として開催された。
明治大の後輩、島内宏明外野手は「改めてすごい人だったんだなと思った」と話した。
And in my dictionary I have the following lines, which match 明治
and 楽天
in the above.
There are no other lines in the dictionary that match any substrings in the sentence.
楽天,1288,1288,100,楽天_4755-2018,名詞,固有名詞,組織,上場会社,*,*,RAKUTEN,楽天,*,*,*,*,*
明治,1288,1288,100,明治_2261-2009,名詞,固有名詞,組織,上場会社,*,*,MEIJI,明治,*,*,*,*,*
When I try and run annotations with this configuration, i get the below error:
...
File "/Users/jb/.pyenv/versions/3.6.1/lib/python3.6/site-packages/spacy/language.py", line 441, in __call__
doc = self.make_doc(text)
File "/Users/jb/.pyenv/versions/3.6.1/lib/python3.6/site-packages/spacy/lang/ja/__init__.py", line 281, in make_doc
return self.tokenizer(text)
File "/Users/jb/.pyenv/versions/3.6.1/lib/python3.6/site-packages/spacy/lang/ja/__init__.py", line 144, in __call__
dtokens = self._get_dtokens(sudachipy_tokens)
File "/Users/jb/.pyenv/versions/3.6.1/lib/python3.6/site-packages/spacy/lang/ja/__init__.py", line 182, in _get_dtokens
) for idx, token in enumerate(sudachipy_tokens) if len(token.surface()) > 0
File "/Users/jb/.pyenv/versions/3.6.1/lib/python3.6/site-packages/spacy/lang/ja/__init__.py", line 182, in <listcomp>
) for idx, token in enumerate(sudachipy_tokens) if len(token.surface()) > 0
File "/Users/jb/.pyenv/versions/3.6.1/lib/python3.6/site-packages/sudachipy/morpheme.py", line 36, in part_of_speech
return self.list.grammar.get_part_of_speech_string(wi.pos_id)
File "/Users/jb/.pyenv/versions/3.6.1/lib/python3.6/site-packages/sudachipy/dictionarylib/grammar.py", line 55, in get_part_of_speech_string
return self.pos_list[pos_id]
IndexError: list index out of range
Could someone advise me as to what is causing this error please?
I am quite certain the sentence with 明治
is causing the issue,as if i remove the second sentence, the annotation works fine. It therefore seems like 楽天
is being picked up by SudachiPy with the dictionary, but 明治
is not.
Why is this?
Metadata
Metadata
Assignees
Labels
No labels