Skip to content

Latest commit

 

History

History
112 lines (82 loc) · 5.42 KB

README.md

File metadata and controls

112 lines (82 loc) · 5.42 KB

RDRPOSTagger

R package to perform Parts of Speech tagging and morphological tagging based on the Ripple Down Rules-based Part-Of-Speech Tagger (RDRPOS) available at https://github.com/datquocnguyen/RDRPOSTagger. RDRPOSTagger supports pre-trained POS tagging models for 45 languages.

The R package allows you to perform 3 types of tagging.

  • UniversalPOS annotation where a reduced Part of Speech and globally used tagset which is consistent across languages is used to assign words with a certain label. This type of tagging is available for the following languages: Ancient_Greek, Ancient_Greek-PROIEL, Arabic, Basque, Belarusian, Bulgarian, Catalan, Chinese, Coptic, Croatian, Czech, Czech-CAC, Czech-CLTT, Danish, Dutch, Dutch-LassySmall, English, English-LinES, English-ParTUT, Estonian, Finnish, Finnish-FTB, French, French-ParTUT, French-Sequoia, Galician, Galician-TreeGal, German, Gothic, Greek, Hebrew, Hindi, Hungarian, Indonesian, Irish, Italian, Italian-ParTUT, Japanese, Korean, Latin, Latin-ITTB, Latin-PROIEL, Latvian, Lithuanian, Norwegian-Bokmaal, Norwegian-Nynorsk, Old_Church_Slavonic, Persian, Polish, Portuguese, Portuguese-BR, Romanian, Russian, Russian-SynTagRus, Slovak, Slovenian, Slovenian-SST, Spanish, Spanish-AnCora, Swedish, Swedish-LinES, Tamil, Turkish, Urdu, Vietnamese.
  • POS for doing Parts of Speech annotation based on an extended language/treebank-specific POS tagset. for This type of tagging is available for the following languages: English, French, German, Hindi, Italian, Thai, Vietnamese
  • MORPH with very detailed morphological annotation. This type of tagging is available for the following languages: Bulgarian, Czech, Dutch, French, German, Portuguese, Spanish, Swedish

This is based on corpora collected and made available at http://universaldependencies.org.

Examples on Parts of Speech tagging

The following shows how to use the package

library(RDRPOSTagger)
models <- rdr_available_models()
models$MORPH$language
models$POS$language
models$UniversalPOS$language

x <- c("Oleg Borisovich Kulik is a Ukrainian-born Russian performance artist")
tagger <- rdr_model(language = "English", annotation = "POS")
rdr_pos(tagger, x = x)

x <- c("Dus godvermehoeren met pus in alle puisten, zei die schele van Van Bukburg.", 
       "Er was toen dat liedje van tietenkonttieten kont tieten kontkontkont",
       "  ", "", NA)
tagger <- rdr_model(language = "Dutch", annotation = "MORPH")
rdr_pos(tagger, x = x)

tagger <- rdr_model(language = "Dutch", annotation = "UniversalPOS")
rdr_pos(tagger, x = x)

The output of the POS tagging shows the following elements:

 doc_id token_id            token   pos
     d1        1              Dus   ADV
     d1        2   godvermehoeren  VERB
     d1        3              met   ADP
     d1        4              pus  NOUN
     d1        5               in   ADP
     d1        6             alle  PRON
     d1        7          puisten  NOUN
     d1        8                , PUNCT
     d1        9              zei  VERB
     d1       10              die  PRON
     d1       11           schele   ADJ
     d1       12              van   ADP
     d1       13              Van PROPN
     d1       14          Bukburg PROPN
     d1       15                . PUNCT
     d2        1               Er   ADV
     d2        2              was   AUX
     d2        3             toen SCONJ
     d2        4              dat SCONJ
     d2        5           liedje  NOUN
     d2        6              van   ADP
     d2        7 tietenkonttieten  VERB
     d2        8             kont PROPN
     d2        9           tieten  VERB
     d2       10     kontkontkont PROPN
     d3        0             <NA>  <NA>
     d4        0             <NA>  <NA>
     d5        0             <NA>  <NA>

More information about the model and the tagging can be found at https://github.com/datquocnguyen/RDRPOSTagger

The general architecture and experimental results of RDRPOSTagger can be found in the following papers:

Installation

Installation can easily be done as follows.

install.packages("rJava")
install.packages("data.table")
install.packages("RDRPOSTagger", repos = "http://www.datatailor.be/rcube", type = "source")

Or with devtools

devtools::install_github("bnosac/RDRPOSTagger", build_vignettes = TRUE)

More details in the package documentation and package vignette

vignette("rdrpostagger-overview", package = "RDRPOSTagger")

License

The package is licensed under the GPL-3 license as described at http://www.gnu.org/licenses/gpl-3.0.html.

Support in text mining

Need support in text mining. Contact BNOSAC: http://www.bnosac.be