Skip to content

Commit

Permalink
Merge pull request #66 from bact/add-toc
Browse files Browse the repository at this point in the history
Add TOC to README
  • Loading branch information
bact authored Nov 9, 2024
2 parents 58d5a9b + ee7e308 commit 3a96afa
Showing 1 changed file with 103 additions and 20 deletions.
123 changes: 103 additions & 20 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,35 +8,69 @@ SPDX-License-Identifier: Apache-2.0
Thai natural language processing library in Rust,
with Python and Node bindings. Formerly oxidized-thainlp.

## Table of contents

- [Features](#features)
- [Dictionary file](#dictionary-file)
- [Usage](#usage)
- [Python binding](#python-binding)
- [Node.js binding](#nodejs-binding)
- [Rust library](#rust-library)
- [Command-line interface](#command-line-interface)
- [Build](#build)
- [Development](#development)
- [License](#license)

## Features

- Thai word tokenizer
- Use maximal-matching dictionary-based tokenization algorithm and honor Thai Character Cluster boundaries
- [2.5x faster](https://github.com/PyThaiNLP/nlpo3/blob/main/nlpo3-python/notebooks/nlpo3_segment_benchmarks.ipynb) than similar pure Python implementation (PyThaiNLP's newmm)
- Load a dictionary from a plain text file (one word per line) or from `Vec<String>`
- Use maximal-matching dictionary-based tokenization algorithm
and honor [Thai Character Cluster][tcc] boundaries
- [2.5x faster][benchmark]
than similar pure Python implementation (PyThaiNLP's newmm)
- Load a dictionary from a plain text file (one word per line)
or from `Vec<String>`

[tcc]: https://dl.acm.org/doi/10.1145/355214.355225
[benchmark]: https://github.com/PyThaiNLP/nlpo3/blob/main/nlpo3-python/notebooks/nlpo3_segment_benchmarks.ipynb

## Dictionary file

- For the interest of library size, nlpO3 does not assume what dictionary the developer would like to use.
It does not come with a dictionary. A dictionary is needed for the dictionary-based word tokenizer.
- For the interest of library size, nlpO3 does not assume what dictionary the
user would like to use, and it does not come with a dictionary.
- A dictionary is needed for the dictionary-based word tokenizer.
- For tokenization dictionary, try
- [words_th.tx](https://github.com/PyThaiNLP/pythainlp/blob/dev/pythainlp/corpus/words_th.txt) from [PyThaiNLP](https://github.com/PyThaiNLP/pythainlp/) - around 62,000 words (CC0)
- [word break dictionary](https://github.com/tlwg/libthai/tree/master/data) from [libthai](https://github.com/tlwg/libthai/) - consists of dictionaries in different categories, with make script (LGPL-2.1)
- [words_th.tx][dict-pythainlp] from [PyThaiNLP][pythainlp]
- ~62,000 words
- CC0-1.0
- [word break dictionary][dict-libthai] from [libthai][libthai]
- consists of dictionaries in different categories, with a make script
- LGPL-2.1

[pythainlp]: https://github.com/PyThaiNLP/pythainlp
[libthai]: https://github.com/tlwg/libthai/
[dict-pythainlp]: https://github.com/PyThaiNLP/pythainlp/blob/dev/pythainlp/corpus/words_th.txt
[dict-libthai]: https://github.com/tlwg/libthai/tree/master/data

## Usage

### Command-line interface
### Node.js binding

- [nlpo3-cli](nlpo3-cli/) <a href="https://crates.io/crates/nlpo3-cli/"><img alt="crates.io" src="https://img.shields.io/crates/v/nlpo3-cli.svg"/></a>
#### Source code

```bash
echo "ฉันกินข้าว" | nlpo3 segment
```
See [nlpo3-nodejs](./nlpo3-nodejs/) directory.

### Python binding

[![PyPI](https://img.shields.io/pypi/v/nlpo3.svg "PyPI")](https://pypi.python.org/pypi/nlpo3)

### Bindings
#### Install

```shell
pip install nlpo3
```

- [Node.js](nlpo3-nodejs/)
- [Python](nlpo3-python/) <a href="https://pypi.python.org/pypi/nlpo3"><img alt="pypi" src="https://img.shields.io/pypi/v/nlpo3.svg"/></a>
#### Example

```python
from nlpo3 import load_dict, segment
Expand All @@ -45,9 +79,19 @@ load_dict("path/to/dict.file", "dict_name")
segment("สวัสดีครับ", "dict_name")
```

### As Rust library
#### Source code

<a href="https://crates.io/crates/nlpo3/"><img alt="crates.io" src="https://img.shields.io/crates/v/nlpo3.svg"/></a>
See [nlpo3-python](./nlpo3-python/) directory.

### Rust library

[![crates.io](https://img.shields.io/crates/v/nlpo3.svg "crates.io")](https://crates.io/crates/nlpo3/)

#### Install

```shell
cargo install nlpo3
```

In `Cargo.toml`:

Expand All @@ -57,6 +101,8 @@ In `Cargo.toml`:
nlpo3 = "1.3.2"
```

#### Example

Create a tokenizer using a dictionary from file,
then use it to tokenize a string (safe mode = true, and parallel mode = false):

Expand Down Expand Up @@ -87,6 +133,36 @@ Remove words from an existing tokenizer:
tokenizer.remove_word(&["กระเพรา", "ชานชลา"]);
```

#### Source code

See the [root](/) directory.

### Command-line interface

[![Crates](https://img.shields.io/crates/v/nlpo3-cli.svg "Crates")](https://crates.io/crates/nlpo3-cli/)

#### Install

```shell
cargo install nlpo3-cli
```

#### Usage

```shell
nlpo3 help
```

#### Example

```bash
echo "ฉันกินข้าว" | nlpo3 segment
```

#### Source code

See [nlpo3-cli](./nlpo3-cli/) directory.

## Build

### Requirements
Expand Down Expand Up @@ -115,10 +191,17 @@ cargo build --release

Check `target/` for build artifacts.

## Development documents
## Development

Development document:

- [Notes on custom string](src/NOTE_ON_STRING.md)

## Issues
Issues:

- Please report issues at <https://github.com/PyThaiNLP/nlpo3/issues>

## License

Please report issues at <https://github.com/PyThaiNLP/nlpo3/issues>
nlpO3 is copyrighted by its authors and licensed under terms of the Apache
Software License 2.0 (Apache-2.0) - see file [LICENSE](./LICENSE) for details.

0 comments on commit 3a96afa

Please sign in to comment.