Skip to content

toyama0919/embulk-filter-kuromoji

Repository files navigation

Kuromoji filter plugin for Embulk

Gem Version

Kuromoji filter plugin for Embulk. Neologd support.

Reference

Overview

  • Plugin type: filter

Configuration

  • tokenizer: select tokenizer.(kuromoji or neologd) (string, default: kuromoji)
  • mode: select mode.(normal or search or extended) (string, default: normal)
  • use_stop_tag: neologd only.(bool, default: false)
  • key_names: description (list, required)
  • keep_input: keep input columns. (bool, default: true)
  • ok_parts_of_speech: ok parts of speech. (list, default: null)
  • dictionary_path: user dictionary file path. (string, default: null)
  • settings: description (list, required)
    • suffix: output column name suffix. if null overwrite column. (string, default: null)
    • method: description (string, required. surface_form or base_form or reading)
    • delimiter: delimiter (string, default: ",")
    • type: extract data type, array or string. array is json type. (string, default: "string")

Neologd Example

filters:
  - type: kuromoji
    tokenizer: neologd
    use_stop_tag: true
    key_names:
      - catchcopy
    settings:
      - { method: 'reading', delimiter: '' }
      - { suffix: _surface_form_no_delim, method: 'surface_form', delimiter: '' }
      - { suffix: _base_form, method: 'base_form', delimiter: '###' }
      - { suffix: _surface_form, method: 'surface_form', delimiter: '###' }
      - { suffix: _array, method: 'surface_form', type: 'array' }

Pure kuromoji Example

filters:
  - type: kuromoji
    keep_input: false
    mode: search
    ok_parts_of_speech:
      - 名詞
    key_names:
      - catchcopy
    settings:
      - { method: 'reading', delimiter: '' }
      - { suffix: _surface_form_no_delim, method: 'surface_form', delimiter: '' }
      - { suffix: _base_form, method: 'base_form', delimiter: '###' }
      - { suffix: _surface_form, method: 'surface_form', delimiter: '###' }
      - { suffix: _array, method: 'surface_form', type: 'array' }

input

{
    "catchcopy" : "安全・安心を追及した曲面ボディにデザインを一新しました。"
}

As below

{
    "catchcopy" : "アンゼン・アンシンヲツイキュウシタキョクメンボディニデザインヲイッシン。",
    "catchcopy_surface_form_no_delim" : "安全・安心を追及した曲面ボディにデザインを一新。",
    "catchcopy_base_form" : "安全###・###安心###を###追及###する###た###曲面###ボディ###に###デザイン###を###一新###。",
    "catchcopy_surface_form" : "安全###・###安心###を###追及###し###た###曲面###ボディ###に###デザイン###を###一新###。",
    "catchcopy_array" : ["安全","","安心","","追及","","","曲面","ボディ","","デザイン","","一新",""]
}

Example2(use user dictionary)

  - type: kuromoji
    keep_input: false
    dictionary_path: /tmp/kuromoji.txt
    ok_parts_of_speech:
      - 名詞
    key_names:
      - catchcopy
    settings:
      - { method: 'reading', delimiter: '#' }
      - { suffix: _surface_form_no_delim, method: 'surface_form', delimiter: '' }
      - { suffix: _base_form, method: 'base_form', delimiter: '###' }
      - { suffix: _surface_form, method: 'surface_form', delimiter: '###' }

user dictionary example

西国分寺,西国分寺,ニシコクブンジ,駅名
東京スカイツリー,東京 スカイツリー,トウキョウ スカイツリー,カスタム名詞

Build

$ ./gradlew gem  # -t to watch change of files and rebuild continuously

About

Morphological analysis plugin for Embulk.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published