Skip to content

Commit

Permalink
Merge pull request #65 from bact/add-license-spdx
Browse files Browse the repository at this point in the history
Add SPDX header and CITATION
  • Loading branch information
bact authored Nov 9, 2024
2 parents 0225e5c + fbc02c4 commit 9b3ff64
Show file tree
Hide file tree
Showing 16 changed files with 147 additions and 46 deletions.
30 changes: 30 additions & 0 deletions CITATION.cff
Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@
cff-version: "1.2.0"
title: "nlpO3"
message: >-
If you use this software, please cite it using these
metadata.
type: software
authors:
- family-names: Suntorntip
given-names: Thanathip
repository-code: "https://github.com/PyThaiNLP/nlpo3/"
repository: "https://github.com/PyThaiNLP/nlpo3/"
url: "https://github.com/PyThaiNLP/nlpo3/"
abstract: "Thai natural language processing library in Rust, with Python and Node bindings. Formerly oxidized-thainlp."
keywords:
- "tokenizer"
- "tokenization"
- "Thai"
- "natural language processing"
- "NLP"
- "Rust"
- "Node.js"
- "Node"
- "Python"
- "text processing"
- "word segmentation"
- "Thai language"
- "Thai NLP"
license: Apache-2.0
version: v1.3.2
date-released: "2023-04-14"
25 changes: 18 additions & 7 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,15 +1,19 @@
---
SPDX-FileCopyrightText: 2024 PyThaiNLP Project
SPDX-License-Identifier: Apache-2.0
---

# nlpO3

Thai Natural Language Processing library in Rust,
Thai natural language processing library in Rust,
with Python and Node bindings. Formerly oxidized-thainlp.

## Features

- Thai word tokenizer
- use maximal-matching dictionary-based tokenization algorithm and honor Thai Character Cluster boundaries
- Use maximal-matching dictionary-based tokenization algorithm and honor Thai Character Cluster boundaries
- [2.5x faster](https://github.com/PyThaiNLP/nlpo3/blob/main/nlpo3-python/notebooks/nlpo3_segment_benchmarks.ipynb) than similar pure Python implementation (PyThaiNLP's newmm)
- load a dictionary from a plain text file (one word per line) or from `Vec<String>`

- Load a dictionary from a plain text file (one word per line) or from `Vec<String>`

## Dictionary file

Expand All @@ -19,7 +23,6 @@ with Python and Node bindings. Formerly oxidized-thainlp.
- [words_th.tx](https://github.com/PyThaiNLP/pythainlp/blob/dev/pythainlp/corpus/words_th.txt) from [PyThaiNLP](https://github.com/PyThaiNLP/pythainlp/) - around 62,000 words (CC0)
- [word break dictionary](https://github.com/tlwg/libthai/tree/master/data) from [libthai](https://github.com/tlwg/libthai/) - consists of dictionaries in different categories, with make script (LGPL-2.1)


## Usage

### Command-line interface
Expand All @@ -31,6 +34,7 @@ echo "ฉันกินข้าว" | nlpo3 segment
```

### Bindings

- [Node.js](nlpo3-nodejs/)
- [Python](nlpo3-python/) <a href="https://pypi.python.org/pypi/nlpo3"><img alt="pypi" src="https://img.shields.io/pypi/v/nlpo3.svg"/></a>

Expand All @@ -42,6 +46,7 @@ segment("สวัสดีครับ", "dict_name")
```

### As Rust library

<a href="https://crates.io/crates/nlpo3/"><img alt="crates.io" src="https://img.shields.io/crates/v/nlpo3.svg"/></a>

In `Cargo.toml`:
Expand All @@ -54,6 +59,7 @@ nlpo3 = "1.3.2"

Create a tokenizer using a dictionary from file,
then use it to tokenize a string (safe mode = true, and parallel mode = false):

```rust
use nlpo3::tokenizer::newmm::NewmmTokenizer;
use nlpo3::tokenizer::tokenizer_trait::Tokenizer;
Expand All @@ -63,17 +69,20 @@ let tokens = tokenizer.segment("ห้องสมุดประชาชน",
```

Create a tokenizer using a dictionary from a vector of Strings:

```rust
let words = vec!["ปาลิเมนต์".to_string(), "คอนสติติวชั่น".to_string()];
let tokenizer = NewmmTokenizer::from_word_list(words);
```

Add words to an existing tokenizer:

```rust
tokenizer.add_word(&["มิวเซียม"]);
```

Remove words from an existing tokenizer:

```rust
tokenizer.remove_word(&["กระเพรา", "ชานชลา"]);
```
Expand All @@ -87,27 +96,29 @@ tokenizer.remove_word(&["กระเพรา", "ชานชลา"]);
### Steps

Generic test:

```bash
cargo test
```

Build API document and open it to check:

```bash
cargo doc --open
```

Build (remove `--release` to keep debug information):

```bash
cargo build --release
```

Check `target/` for build artifacts.


## Development documents

- [Notes on custom string](src/NOTE_ON_STRING.md)

## Issues

Please report issues at https://github.com/PyThaiNLP/nlpo3/issues
Please report issues at <https://github.com/PyThaiNLP/nlpo3/issues>
19 changes: 12 additions & 7 deletions src/NOTE_ON_STRING.md
Original file line number Diff line number Diff line change
@@ -1,24 +1,29 @@
---
SPDX-FileCopyrightText: 2024 PyThaiNLP Project
SPDX-License-Identifier: Apache-2.0
---

# Why Use Handroll Bytes Slice As "CustomString" Instead of Rust String?

Rust String (and &str) is actually a slice of valid UTF-8 bytes which is
Rust `String` (and `&str`) is actually a slice of valid UTF-8 bytes which is
variable-length. It has no way of accessing a random index UTF-8 "character"
with O(1) time complexity.
with O(1) time complexity.

This means any algorithm with operations based on "character" index position
will be horribly slow on Rust String.

Hence, "fixed_bytes_str" which is transformed from a slice of valid UTF-8
Hence, `fixed_bytes_str` which is transformed from a slice of valid UTF-8
bytes into a slice of 4-bytes length - padded left with 0.

Consequently, regular expressions must be padded with \x00 for each unicode
Consequently, regular expressions must be padded with `\x00` for each Unicode
character to have 4 bytes.

Thai characters are 3-bytes length, so every Thai char in regex is padded
with \x00 one time.

For "space" in regex, it is padded with \x00\x00\x00.
with `\x00` one time.

For "space" in regex, it is padded with `\x00\x00\x00`.

## References

- [Rust String indexing and internal representation](https://doc.rust-lang.org/book/ch08-02-strings.html#indexing-into-strings)
- Read more about [UTF-8](https://en.wikipedia.org/wiki/UTF-8) at Wikipedia.
3 changes: 3 additions & 0 deletions src/four_bytes_str.rs
Original file line number Diff line number Diff line change
@@ -1,2 +1,5 @@
// SPDX-FileCopyrightText: 2024 PyThaiNLP Project
// SPDX-License-Identifier: Apache-2.0

pub mod custom_regex;
pub mod custom_string;
12 changes: 9 additions & 3 deletions src/four_bytes_str/custom_regex.rs
Original file line number Diff line number Diff line change
@@ -1,7 +1,13 @@
// This is a result of an attempt to create a formatter
// which translates normal, human readable thai regex
// into 4-bytes zero-left-pad bytes regex pattern string
// SPDX-FileCopyrightText: 2024 PyThaiNLP Project
// SPDX-License-Identifier: Apache-2.0

/**
* Regex for a custom four-byte string.
*
* This is a result of an attempt to create a formatter
* which translates normal, human readable thai regex
* into 4-bytes zero-left-pad bytes regex pattern string
*/
use anyhow::{Error as AnyError, Result};
use regex_syntax::{
hir::{Anchor, Class, Group, Literal as LiteralEnum, Repetition},
Expand Down
9 changes: 7 additions & 2 deletions src/four_bytes_str/custom_string.rs
Original file line number Diff line number Diff line change
@@ -1,5 +1,10 @@
/// Functions dealing with a custom four-byte string.
/// For more details, see src/NOTE_ON_STRING.md
// SPDX-FileCopyrightText: 2024 PyThaiNLP Project
// SPDX-License-Identifier: Apache-2.0

/**
* Functions dealing with a custom four-byte string.
* For more details, see src/NOTE_ON_STRING.md
*/
use std::{
error::{self, Error},
fmt::Display,
Expand Down
3 changes: 3 additions & 0 deletions src/lib.rs
Original file line number Diff line number Diff line change
@@ -1,2 +1,5 @@
// SPDX-FileCopyrightText: 2024 PyThaiNLP Project
// SPDX-License-Identifier: Apache-2.0

mod four_bytes_str;
pub mod tokenizer;
3 changes: 3 additions & 0 deletions src/tokenizer.rs
Original file line number Diff line number Diff line change
@@ -1,3 +1,6 @@
// SPDX-FileCopyrightText: 2024 PyThaiNLP Project
// SPDX-License-Identifier: Apache-2.0

mod dict_reader;
pub mod newmm;
pub(crate) mod tcc;
Expand Down
6 changes: 6 additions & 0 deletions src/tokenizer/dict_reader.rs
Original file line number Diff line number Diff line change
@@ -1,3 +1,9 @@
// SPDX-FileCopyrightText: 2024 PyThaiNLP Project
// SPDX-License-Identifier: Apache-2.0

/**
* Dictionary reader.
*/
use crate::four_bytes_str::custom_string::CustomString;

use super::trie_char::TrieChar as Trie;
Expand Down
27 changes: 15 additions & 12 deletions src/tokenizer/newmm.rs
Original file line number Diff line number Diff line change
@@ -1,15 +1,18 @@
/**
Dictionary-based maximal matching word segmentation, constrained with
Thai Character Cluster (TCC) boundaries.
The code is based on the notebooks created by Korakot Chaovavanich,
with heuristic graph size limit added to avoid exponential wait time.
:See Also:
* \
https://github.com/PyThaiNLP/pythainlp/blob/dev/pythainlp/tokenize/newmm.py
// SPDX-FileCopyrightText: 2024 PyThaiNLP Project
// SPDX-License-Identifier: Apache-2.0

Rust implementation: ["Thanathip Suntorntip"]
/**
* Dictionary-based maximal matching word segmentation, constrained with
* Thai Character Cluster (TCC) boundaries.
*
* The code is based on the notebooks created by Korakot Chaovavanich,
* with heuristic graph size limit added to avoid exponential wait time.
*
* :See Also:
* * \
* https://github.com/PyThaiNLP/pythainlp/blob/dev/pythainlp/tokenize/newmm.py
*
* Rust implementation: ["Thanathip Suntorntip"]
*/
use std::{collections::VecDeque, error::Error, fmt::Display, path::PathBuf};

Expand Down Expand Up @@ -167,7 +170,7 @@ impl NewmmTokenizer {

fn one_cut<'a>(
input: &'a CustomString,
custom_dict: & Trie,
custom_dict: &Trie,
) -> AnyResult<Vec<&'a CustomStringBytesSlice>> {
let text = input;
let input_char_len = text.chars_len();
Expand Down
5 changes: 4 additions & 1 deletion src/tokenizer/tcc.rs
Original file line number Diff line number Diff line change
@@ -1,2 +1,5 @@
// SPDX-FileCopyrightText: 2024 PyThaiNLP Project
// SPDX-License-Identifier: Apache-2.0

pub(crate) mod tcc_rules;
pub(crate) mod tcc_tokenizer;
pub(crate) mod tcc_rules;
8 changes: 7 additions & 1 deletion src/tokenizer/tcc/tcc_rules.rs
Original file line number Diff line number Diff line change
@@ -1,3 +1,9 @@
// SPDX-FileCopyrightText: 2024 PyThaiNLP Project
// SPDX-License-Identifier: Apache-2.0

/**
* Rules for TCC (Thai Character Cluster) tokenization.
*/
use crate::four_bytes_str::custom_regex::regex_pattern_to_custom_pattern;
use lazy_static::lazy_static;
use regex::bytes::Regex;
Expand Down Expand Up @@ -132,7 +138,7 @@ fn tcc_regex_test_cases() {
let case_20 = replace_tcc_symbol("^แccc์");
let case_21 = replace_tcc_symbol("^โctะ");
let case_22 = replace_tcc_symbol("^[เ-ไ]ct");

// This is the only Karan case.
assert_eq!(
regex_pattern_to_custom_pattern(&case_1).unwrap(),
Expand Down
7 changes: 6 additions & 1 deletion src/tokenizer/tcc/tcc_tokenizer.rs
Original file line number Diff line number Diff line change
@@ -1,3 +1,9 @@
// SPDX-FileCopyrightText: 2024 PyThaiNLP Project
// SPDX-License-Identifier: Apache-2.0

/**
* TCC (Thai Character Cluster) tokenizer.
*/
use super::tcc_rules::{LOOKAHEAD_TCC, NON_LOOKAHEAD_TCC};

use crate::four_bytes_str::custom_string::{
Expand All @@ -17,7 +23,6 @@ Credits:
* Rust Code Translation: Thanathip Suntorntip
*/


/// Returns a set of "character" indice at the end of each token
pub fn tcc_pos(custom_text_type: &CustomStringBytesSlice) -> HashSet<usize> {
let mut set: HashSet<usize> = HashSet::default();
Expand Down
3 changes: 3 additions & 0 deletions src/tokenizer/tokenizer_trait.rs
Original file line number Diff line number Diff line change
@@ -1,3 +1,6 @@
// SPDX-FileCopyrightText: 2024 PyThaiNLP Project
// SPDX-License-Identifier: Apache-2.0

use anyhow::Result as AnyResult;

pub trait Tokenizer {
Expand Down
27 changes: 15 additions & 12 deletions src/tokenizer/trie_char.rs
Original file line number Diff line number Diff line change
@@ -1,15 +1,18 @@
///This module is meant to be a direct implementation of Dict Trie in PythaiNLP.
///
///Many functions are implemented as a recursive function because of the limits imposed by
///Rust Borrow Checker and this author's (Thanathip) little experience.
///
///Rust Code: Thanathip Suntorntip (Gorlph)
///
/// For basic information of trie, visit this wikipedia page https://en.wikipedia.org/wiki/Trie



// SPDX-FileCopyrightText: 2024 PyThaiNLP Project
// SPDX-License-Identifier: Apache-2.0

/**
* This module is meant to be a direct implementation of Dict Trie in PyThaiNLP.
*
* Many functions are implemented as a recursive function
* because of the limits imposed by Rust Borrow Checker and
* this author's (Thanathip) little experience.
*
* Rust Code: Thanathip Suntorntip (Gorlph)
*
* For basic information of trie, visit this wikipedia page
* https://en.wikipedia.org/wiki/Trie
*/
use crate::four_bytes_str::custom_string::{
CustomString, CustomStringBytesSlice, CustomStringBytesVec, FixedCharsLengthByteSlice,
};
Expand Down
6 changes: 6 additions & 0 deletions tests/test_tokenizer.rs
Original file line number Diff line number Diff line change
@@ -1,3 +1,9 @@
// SPDX-FileCopyrightText: 2024 PyThaiNLP Project
// SPDX-License-Identifier: Apache-2.0

/**
* Test the NewmmTokenizer with the default dictionary.
*/
use nlpo3::tokenizer::newmm::NewmmTokenizer;
use nlpo3::tokenizer::tokenizer_trait::Tokenizer;

Expand Down

0 comments on commit 9b3ff64

Please sign in to comment.