-
Notifications
You must be signed in to change notification settings - Fork 7
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Merge pull request #65 from bact/add-license-spdx
Add SPDX header and CITATION
- Loading branch information
Showing
16 changed files
with
147 additions
and
46 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,30 @@ | ||
cff-version: "1.2.0" | ||
title: "nlpO3" | ||
message: >- | ||
If you use this software, please cite it using these | ||
metadata. | ||
type: software | ||
authors: | ||
- family-names: Suntorntip | ||
given-names: Thanathip | ||
repository-code: "https://github.com/PyThaiNLP/nlpo3/" | ||
repository: "https://github.com/PyThaiNLP/nlpo3/" | ||
url: "https://github.com/PyThaiNLP/nlpo3/" | ||
abstract: "Thai natural language processing library in Rust, with Python and Node bindings. Formerly oxidized-thainlp." | ||
keywords: | ||
- "tokenizer" | ||
- "tokenization" | ||
- "Thai" | ||
- "natural language processing" | ||
- "NLP" | ||
- "Rust" | ||
- "Node.js" | ||
- "Node" | ||
- "Python" | ||
- "text processing" | ||
- "word segmentation" | ||
- "Thai language" | ||
- "Thai NLP" | ||
license: Apache-2.0 | ||
version: v1.3.2 | ||
date-released: "2023-04-14" |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,24 +1,29 @@ | ||
--- | ||
SPDX-FileCopyrightText: 2024 PyThaiNLP Project | ||
SPDX-License-Identifier: Apache-2.0 | ||
--- | ||
|
||
# Why Use Handroll Bytes Slice As "CustomString" Instead of Rust String? | ||
|
||
Rust String (and &str) is actually a slice of valid UTF-8 bytes which is | ||
Rust `String` (and `&str`) is actually a slice of valid UTF-8 bytes which is | ||
variable-length. It has no way of accessing a random index UTF-8 "character" | ||
with O(1) time complexity. | ||
with O(1) time complexity. | ||
|
||
This means any algorithm with operations based on "character" index position | ||
will be horribly slow on Rust String. | ||
|
||
Hence, "fixed_bytes_str" which is transformed from a slice of valid UTF-8 | ||
Hence, `fixed_bytes_str` which is transformed from a slice of valid UTF-8 | ||
bytes into a slice of 4-bytes length - padded left with 0. | ||
|
||
Consequently, regular expressions must be padded with \x00 for each unicode | ||
Consequently, regular expressions must be padded with `\x00` for each Unicode | ||
character to have 4 bytes. | ||
|
||
Thai characters are 3-bytes length, so every Thai char in regex is padded | ||
with \x00 one time. | ||
|
||
For "space" in regex, it is padded with \x00\x00\x00. | ||
with `\x00` one time. | ||
|
||
For "space" in regex, it is padded with `\x00\x00\x00`. | ||
|
||
## References | ||
|
||
- [Rust String indexing and internal representation](https://doc.rust-lang.org/book/ch08-02-strings.html#indexing-into-strings) | ||
- Read more about [UTF-8](https://en.wikipedia.org/wiki/UTF-8) at Wikipedia. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,2 +1,5 @@ | ||
// SPDX-FileCopyrightText: 2024 PyThaiNLP Project | ||
// SPDX-License-Identifier: Apache-2.0 | ||
|
||
pub mod custom_regex; | ||
pub mod custom_string; |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,2 +1,5 @@ | ||
// SPDX-FileCopyrightText: 2024 PyThaiNLP Project | ||
// SPDX-License-Identifier: Apache-2.0 | ||
|
||
mod four_bytes_str; | ||
pub mod tokenizer; |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,2 +1,5 @@ | ||
// SPDX-FileCopyrightText: 2024 PyThaiNLP Project | ||
// SPDX-License-Identifier: Apache-2.0 | ||
|
||
pub(crate) mod tcc_rules; | ||
pub(crate) mod tcc_tokenizer; | ||
pub(crate) mod tcc_rules; |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters