Skip to content
This repository was archived by the owner on Mar 9, 2023. It is now read-only.
This repository was archived by the owner on Mar 9, 2023. It is now read-only.

Problem with user defined dictionary #143

Open
@JSB97

Description

@JSB97

I am making use of sudachipy via ginza, and am trying to annotate the following sentences.

プロ野球の中日で選手、監督を務め、1月4日に70歳で死去した星野仙一氏をしのび、3日、名古屋市東区のナゴヤドームで行われた中日―楽天のオープン戦は追悼試合として開催された。
明治大の後輩、島内宏明外野手は「改めてすごい人だったんだなと思った」と話した。

And in my dictionary I have the following lines, which match 明治 and 楽天 in the above.
There are no other lines in the dictionary that match any substrings in the sentence.

楽天,1288,1288,100,楽天_4755-2018,名詞,固有名詞,組織,上場会社,*,*,RAKUTEN,楽天,*,*,*,*,*
明治,1288,1288,100,明治_2261-2009,名詞,固有名詞,組織,上場会社,*,*,MEIJI,明治,*,*,*,*,*

When I try and run annotations with this configuration, i get the below error:

... 

  File "/Users/jb/.pyenv/versions/3.6.1/lib/python3.6/site-packages/spacy/language.py", line 441, in __call__
    doc = self.make_doc(text)
  File "/Users/jb/.pyenv/versions/3.6.1/lib/python3.6/site-packages/spacy/lang/ja/__init__.py", line 281, in make_doc
    return self.tokenizer(text)
  File "/Users/jb/.pyenv/versions/3.6.1/lib/python3.6/site-packages/spacy/lang/ja/__init__.py", line 144, in __call__
    dtokens = self._get_dtokens(sudachipy_tokens)
  File "/Users/jb/.pyenv/versions/3.6.1/lib/python3.6/site-packages/spacy/lang/ja/__init__.py", line 182, in _get_dtokens
    ) for idx, token in enumerate(sudachipy_tokens) if len(token.surface()) > 0
  File "/Users/jb/.pyenv/versions/3.6.1/lib/python3.6/site-packages/spacy/lang/ja/__init__.py", line 182, in <listcomp>
    ) for idx, token in enumerate(sudachipy_tokens) if len(token.surface()) > 0
  File "/Users/jb/.pyenv/versions/3.6.1/lib/python3.6/site-packages/sudachipy/morpheme.py", line 36, in part_of_speech
    return self.list.grammar.get_part_of_speech_string(wi.pos_id)
  File "/Users/jb/.pyenv/versions/3.6.1/lib/python3.6/site-packages/sudachipy/dictionarylib/grammar.py", line 55, in get_part_of_speech_string
    return self.pos_list[pos_id]
IndexError: list index out of range

Could someone advise me as to what is causing this error please?

I am quite certain the sentence with 明治 is causing the issue,as if i remove the second sentence, the annotation works fine. It therefore seems like 楽天 is being picked up by SudachiPy with the dictionary, but 明治 is not.

Why is this?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions