Lucene Analyzers for Tibetan (Forked for THL)

This repository contains Lucene tools (analysers, tokenizers and filters) for the Tibetan Language. They are based on these Lucene analyzers.

Content summary:

a convertor from EWTS, DTS or ALALC encodings to Tibetan Unicode
a filter to normalise unicode Tibetan characters
a filter to remove obvious affixed particles
a stopword filter
a syllable-based tokenizer
a maxmatch-based word tokenizer that:
- can lemmatize (remove ambiguous affixes ར and ས)
- uses user-defined word lists

Installation through maven:

    <dependency>
      <groupId>io.bdrc.lucene</groupId>
      <artifactId>lucene-bo</artifactId>
      <version>1.5.0</version>
    </dependency>

If the jar is needed for use in a non-maven based install, it may be found at

    https://repo1.maven.org/maven2/io/bdrc/lucene/lucene-bo/1.2.0/lucene-bo-1.2.0.jar

Building from source

First, make sure the submodule is initialized (git submodule init, then git submodule update from the root of the repo)

The base command line to build a jar is:

mvn clean compile exec:java package

The following options alter the packaging:

-DincludeDeps=true includes io.bdrc.lucene:stemmer and io.bdrc.ewtsconverter:ewts-converter in the produced jar file
-DperformRelease=true signs the jar file with gpg

Components

TibetanAnalyzer

The main Analyzer. It tokenizes the input text using TibSyllableTokenizer, then applies TibAffixedFilter and StopFilter with a predefined list of stop words.

There are two constructors. The nullary constructor and

    TibetanAnalyzer(boolean segmentInWords, boolean lemmatize, boolean filterChars, boolean fromEwts, String lexiconFileName)

    segmentInWords - if the segmentation is on words instead of syllables
    lemmatize - in syllable mode removes affixed particles and normalizes ba/bo in pa/po, in word segmentation uses lemmas
    filterChars - if the text should be converted to NFD (necessary for texts containing NFC strings)
    inputMode - "unicode" (default), "ewts", "dts" (Diacritics Transliteration Schema) or "alalc" ([ALA-LC](https://www.loc.gov/catdir/cpso/romanization/tibetan.pdf))
    stopFilename - file name of the stop word list (defaults to empty string for the shipped one, set to null for no stop words)

The nullary constructor is equivalent to TibetanAnalyzer(true, true, true, false, null)

TibWordTokenizer

This tokenizer produces words through a Maximal Matching algorithm. It builds on top of this Trie implementation.

Due to its design, this tokenizer doesn't deal with contextual ambiguities.

For example, if both དོན and དོན་གྲུབ exist in the Trie, དོན་གྲུབ will be returned every time the sequence དོན + གྲུབ is found.

The sentence སེམས་ཅན་གྱི་དོན་གྲུབ་པར་ཤོག will be tokenized into "སེམས་ཅན + གྱི + དོན་གྲུབ + པར + ཤོག" (སེམས་ཅན + གྱི + དོན + གྲུབ་པར + ཤོག expected)

TibSyllableTokenizer

This tokenizer produces syllable tokens (with no tshek) from the input Tibetan text.

TibAffixedFilter

This filter removes non-ambiguous affixed particles (འི, འོ, འིའོ, འམ, འང and འིས), leaving the འ if necessary (ex: དགའི -> དགའ, གའི -> ག).

PaBaFilter

This filter normalizes བ and བོ into པ and པོ. It does not look into affixed particles and thus should be used after TibAffixedFilter.

Maven Build Options

To sign the .jars before deploying, pass -DperformRelease=true ; to include ewts-converter and stemmer in the built jar, pass -DincludeDeps=true.

License

Build Info

You will need Java and Maven. Then run the following command from the root directory of the repo:

mvn clean compile exec:java package -Dmaven.test.skip=true

This was last done with Java 1.8.0_251 and Maven 3.6.3

Name		Name	Last commit message	Last commit date
Latest commit History 195 Commits
resources @ 0eadd08		resources @ 0eadd08
src		src
.editorconfig		.editorconfig
.gitignore		.gitignore
.gitmodules		.gitmodules
CHANGELOG.md		CHANGELOG.md
LICENCE		LICENCE
README.md		README.md
pom.xml		pom.xml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Lucene Analyzers for Tibetan (Forked for THL)

Installation through maven:

Building from source

Components

TibetanAnalyzer

TibWordTokenizer

TibSyllableTokenizer

TibAffixedFilter

PaBaFilter

Maven Build Options

License

Build Info

About

Releases

Packages

Languages

License

thl/lucene-bo

Folders and files

Latest commit

History

Repository files navigation

Lucene Analyzers for Tibetan (Forked for THL)

Installation through maven:

Building from source

Components

TibetanAnalyzer

TibWordTokenizer

TibSyllableTokenizer

TibAffixedFilter

PaBaFilter

Maven Build Options

License

Build Info

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages