From ee7e308901265927cf4077e2db6c5d282b3ab0c2 Mon Sep 17 00:00:00 2001 From: Arthit Suriyawongkul Date: Sat, 9 Nov 2024 18:51:28 +0000 Subject: [PATCH] Add TOC to README --- README.md | 123 +++++++++++++++++++++++++++++++++++++++++++++--------- 1 file changed, 103 insertions(+), 20 deletions(-) diff --git a/README.md b/README.md index 131c6f6..071f545 100644 --- a/README.md +++ b/README.md @@ -8,35 +8,69 @@ SPDX-License-Identifier: Apache-2.0 Thai natural language processing library in Rust, with Python and Node bindings. Formerly oxidized-thainlp. +## Table of contents + +- [Features](#features) +- [Dictionary file](#dictionary-file) +- [Usage](#usage) + - [Python binding](#python-binding) + - [Node.js binding](#nodejs-binding) + - [Rust library](#rust-library) + - [Command-line interface](#command-line-interface) +- [Build](#build) +- [Development](#development) +- [License](#license) + ## Features - Thai word tokenizer - - Use maximal-matching dictionary-based tokenization algorithm and honor Thai Character Cluster boundaries - - [2.5x faster](https://github.com/PyThaiNLP/nlpo3/blob/main/nlpo3-python/notebooks/nlpo3_segment_benchmarks.ipynb) than similar pure Python implementation (PyThaiNLP's newmm) - - Load a dictionary from a plain text file (one word per line) or from `Vec` + - Use maximal-matching dictionary-based tokenization algorithm + and honor [Thai Character Cluster][tcc] boundaries + - [2.5x faster][benchmark] + than similar pure Python implementation (PyThaiNLP's newmm) + - Load a dictionary from a plain text file (one word per line) + or from `Vec` + +[tcc]: https://dl.acm.org/doi/10.1145/355214.355225 +[benchmark]: https://github.com/PyThaiNLP/nlpo3/blob/main/nlpo3-python/notebooks/nlpo3_segment_benchmarks.ipynb ## Dictionary file -- For the interest of library size, nlpO3 does not assume what dictionary the developer would like to use. - It does not come with a dictionary. A dictionary is needed for the dictionary-based word tokenizer. +- For the interest of library size, nlpO3 does not assume what dictionary the + user would like to use, and it does not come with a dictionary. +- A dictionary is needed for the dictionary-based word tokenizer. - For tokenization dictionary, try - - [words_th.tx](https://github.com/PyThaiNLP/pythainlp/blob/dev/pythainlp/corpus/words_th.txt) from [PyThaiNLP](https://github.com/PyThaiNLP/pythainlp/) - around 62,000 words (CC0) - - [word break dictionary](https://github.com/tlwg/libthai/tree/master/data) from [libthai](https://github.com/tlwg/libthai/) - consists of dictionaries in different categories, with make script (LGPL-2.1) + - [words_th.tx][dict-pythainlp] from [PyThaiNLP][pythainlp] + - ~62,000 words + - CC0-1.0 + - [word break dictionary][dict-libthai] from [libthai][libthai] + - consists of dictionaries in different categories, with a make script + - LGPL-2.1 + +[pythainlp]: https://github.com/PyThaiNLP/pythainlp +[libthai]: https://github.com/tlwg/libthai/ +[dict-pythainlp]: https://github.com/PyThaiNLP/pythainlp/blob/dev/pythainlp/corpus/words_th.txt +[dict-libthai]: https://github.com/tlwg/libthai/tree/master/data ## Usage -### Command-line interface +### Node.js binding -- [nlpo3-cli](nlpo3-cli/) crates.io +#### Source code -```bash -echo "ฉันกินข้าว" | nlpo3 segment -``` +See [nlpo3-nodejs](./nlpo3-nodejs/) directory. + +### Python binding + +[![PyPI](https://img.shields.io/pypi/v/nlpo3.svg "PyPI")](https://pypi.python.org/pypi/nlpo3) -### Bindings +#### Install + +```shell +pip install nlpo3 +``` -- [Node.js](nlpo3-nodejs/) -- [Python](nlpo3-python/) pypi +#### Example ```python from nlpo3 import load_dict, segment @@ -45,9 +79,19 @@ load_dict("path/to/dict.file", "dict_name") segment("สวัสดีครับ", "dict_name") ``` -### As Rust library +#### Source code -crates.io +See [nlpo3-python](./nlpo3-python/) directory. + +### Rust library + +[![crates.io](https://img.shields.io/crates/v/nlpo3.svg "crates.io")](https://crates.io/crates/nlpo3/) + +#### Install + +```shell +cargo install nlpo3 +``` In `Cargo.toml`: @@ -57,6 +101,8 @@ In `Cargo.toml`: nlpo3 = "1.3.2" ``` +#### Example + Create a tokenizer using a dictionary from file, then use it to tokenize a string (safe mode = true, and parallel mode = false): @@ -87,6 +133,36 @@ Remove words from an existing tokenizer: tokenizer.remove_word(&["กระเพรา", "ชานชลา"]); ``` +#### Source code + +See the [root](/) directory. + +### Command-line interface + +[![Crates](https://img.shields.io/crates/v/nlpo3-cli.svg "Crates")](https://crates.io/crates/nlpo3-cli/) + +#### Install + +```shell +cargo install nlpo3-cli +``` + +#### Usage + +```shell +nlpo3 help +``` + +#### Example + +```bash +echo "ฉันกินข้าว" | nlpo3 segment +``` + +#### Source code + +See [nlpo3-cli](./nlpo3-cli/) directory. + ## Build ### Requirements @@ -115,10 +191,17 @@ cargo build --release Check `target/` for build artifacts. -## Development documents +## Development + +Development document: - [Notes on custom string](src/NOTE_ON_STRING.md) -## Issues +Issues: + +- Please report issues at + +## License -Please report issues at +nlpO3 is copyrighted by its authors and licensed under terms of the Apache +Software License 2.0 (Apache-2.0) - see file [LICENSE](./LICENSE) for details.