diff --git a/README.md b/README.md
index 79d88bc..76c4238 100644
--- a/README.md
+++ b/README.md
@@ -8,10 +8,10 @@
split-lang
-Splitting sentences by concatenating over-split substrings based on their language
+Splitting sentences by languages through concatenating over split substrings based on their language
powered by
-1. splitting: [`wtpsplit`](https://github.com/segment-any-text/wtpsplit) and [`budoux`](https://github.com/google/budoux)
-2. language detection: [`fast-langdetect`](https://github.com/LlmKira/fast-langdetect) and [`langdetect`](https://github.com/Mimino666/langdetect)
+1. splitting: [`budoux`](https://github.com/google/budoux) and rule-base splitting
+2. language detection: [`fast-langdetect`](https://github.com/LlmKira/fast-langdetect) and [`lingua-py`](https://github.com/pemistahl/lingua-py)
@@ -41,11 +41,17 @@ powered by
**Stage 1**: rule-based split using punctuation
- `hello, how are you` -> `hello` | `,` | `how are you`
-**Stage 2**: then, over-split text to substrings by `wtpsplit`
+**Stage 2**: then, over-split text to substrings by `budoux`, ` ` (space) and regex
- `你喜欢看アニメ吗` -> `你` | `喜欢` | `看` | `アニメ` | `吗`
+- `昨天見た映画はとても感動的でした` -> `昨天` | `見た` | `映画` | `はとても` | `感動的` | `でした`
+- `我朋友是日本人彼はとても優しいです` -> `我` | `朋友` | `是` | `日本人` | `彼は` | `とても` | `優しいです`
+- `how are you` -> `how ` | `are ` | `you`
**Stage 3**: concatenate substrings based on their languages using `fast-langdetect` and `langdetect`
- `你` | `喜欢` | `看` | `アニメ` | `吗` -> `你喜欢看` | `アニメ` | `吗`
+- `昨天` | `見た` | `映画` | `はとても` | `感動的` | `でした` -> `昨天` | `見た映画はとても感動的でした`
+- `我` | `朋友` | `是` | `日本人` | `彼は` | `とても` | `優しいです` -> `我朋友是日本人` | `彼はとても優しいです`
+- `how ` | `are ` | `you` -> `how are you`
# 2. Motivation
1. TTS (Text-To-Speech) model often fails on multi-language sentence, separate sentence based on language will bring better result
@@ -63,9 +69,9 @@ Vielen Dank merci beaucoup for your help.
- [3.1. Installation](#31-installation)
- [3.2. Basic](#32-basic)
- [3.2.1. `split_by_lang`](#321-split_by_lang)
+ - [3.2.2. `merge_across_digit`](#322-merge_across_digit)
- [3.3. Advanced](#33-advanced)
- - [3.3.1. `TextSplitter` and `threshold`](#331-textsplitter-and-threshold)
- - [3.3.2. usage of `lang_map` and `default_lang` (for better result)](#332-usage-of-lang_map-and-default_lang-for-better-result)
+ - [3.3.1. usage of `lang_map` and `default_lang` (for better result)](#331-usage-of-lang_map-and-default_lang-for-better-result)
- [4. Acknowledgement](#4-acknowledgement)
@@ -80,17 +86,18 @@ pip install split-lang
```
-
+****
## 3.2. Basic
### 3.2.1. `split_by_lang`
[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/DoodleBears/split-lang/blob/main/split-lang-demo.ipynb)
```python
-from split_lang import split_by_lang
-text = "你喜欢看アニメ吗我也喜欢看"
+from split_lang import LangSplitter
+lang_splitter = LangSplitter()
+text = "你喜欢看アニメ吗"
-substr = split_by_lang(
+substr = lang_splitter.split_by_lang(
text=text,
)
for index, item in enumerate(substr):
@@ -100,106 +107,78 @@ for index, item in enumerate(substr):
```
0|zh:你喜欢看
1|ja:アニメ
-2|zh:吗我也喜欢看
+2|zh:吗
```
```python
-from split_lang import split_by_lang
+from split_lang import LangSplitter
+lang_splitter = LangSplitter(merge_across_punctuation=True)
import time
texts = [
- "你喜欢看アニメ吗我也喜欢看",
+ "你喜欢看アニメ吗?我也喜欢看",
"Please star this project on GitHub, Thanks you. I love you请加星这个项目,谢谢你。我爱你この項目をスターしてください、ありがとうございます!愛してる",
]
time1 = time.time()
for text in texts:
- substr = split_by_lang(
+ substr = lang_splitter.split_by_lang(
text=text,
- threshold=4.9e-5,
- merge_across_punctuation=True,
)
for index, item in enumerate(substr):
print(f"{index}|{item.lang}:{item.text}")
print("----------------------")
time2 = time.time()
-
-for text in texts:
- substr = split_by_lang(
- text=text,
- threshold=4.9e-5,
- merge_across_punctuation=False,
- merge_across_digit=False,
- )
- for index, item in enumerate(substr):
- print(f"{index}|{item.lang}:{item.text}")
- print("----------------------")
-time3 = time.time()
-
print(time2 - time1)
-print(time3 - time2)
```
-
```
0|zh:你喜欢看
1|ja:アニメ
-2|zh:吗我也喜欢看
+2|zh:吗?我也喜欢看
----------------------
0|en:Please star this project on GitHub, Thanks you. I love you
1|zh:请加星这个项目,谢谢你。我爱你
2|ja:この項目をスターしてください、ありがとうございます!愛してる
----------------------
-0|zh:你喜欢看
-1|ja:アニメ
-2|zh:吗我也喜欢看
-----------------------
-0|en:Please star this project on GitHub
-1|punctuation:,
-2|en:Thanks you
-3|punctuation:.
-4|en:I love you
-5|zh:请加星这个项目
-6|punctuation:,
-7|zh:谢谢你
-8|punctuation:。
-9|zh:我爱你
-10|ja:この項目をスターしてください
-11|punctuation:、
-12|ja:ありがとうございます
-13|punctuation:!
-14|ja:愛してる
-----------------------
-0.15833711624145508
-0.1587212085723877
+0.007998466491699219
```
-## 3.3. Advanced
-
-### 3.3.1. `TextSplitter` and `threshold`
+### 3.2.2. `merge_across_digit`
-`TextSplitter` is a class which implement `split()` method to split the text after splitting with rule-based logic ([Idea-Stage 2](#1-idea)).
-
-By default, it using `WtP` model from `wtpsplit`. (since `WtP` is faster and more accurate in SHORT TEXT situation, switch to `SaT` model for long paragraph).
-
-the `threshold` is used for `WtP` and `SaT` models, default to `1e-4`, the smaller the more substring you will get in `wtpsplit` stage.
+```python
+lang_splitter.merge_across_digit = False
+texts = [
+ "衬衫的价格是9.15便士",
+]
+for text in texts:
+ substr = lang_splitter.split_by_lang(
+ text=text,
+ )
+ for index, item in enumerate(substr):
+ print(f"{index}|{item.lang}:{item.text}")
+```
-> [!NOTE]
-> Check GitHub Repo `tests/split_acc.py` to find best threshold for your use case
+```
+0|zh:衬衫的价格是
+1|digit:9.15
+2|zh:便士
+```
+## 3.3. Advanced
-### 3.3.2. usage of `lang_map` and `default_lang` (for better result)
+### 3.3.1. usage of `lang_map` and `default_lang` (for better result)
> [!IMPORTANT]
> Add lang code for your usecase if other languages are needed
- default `lang_map` looks like below
- - if `langdetect` or `fasttext` or any other language detector detect the language that is NOT included in `lang_map` will be set to `default_lang`
+ - if `langua-py` or `fasttext` or any other language detector detect the language that is NOT included in `lang_map` will be set to `default_lang`
- if you set `default_lang` or `value` of `key:value` in `lang_map` to `x`, this substring will be merged to the near substring
- `zh` | `x` | `jp` -> `zh` | `jp` (`x` been merged to one side)
- In example below, `zh-tw` is set to `x` because character in `zh` and `jp` sometimes been detected as Traditional Chinese
- default `default_lang` is `x`
```python
-LANG_MAP = {
+DEFAULT_LANG_MAP = {
"zh": "zh",
"yue": "zh", # 粤语
"wuu": "zh", # 吴语
@@ -210,12 +189,14 @@ LANG_MAP = {
"de": "de",
"fr": "fr",
"en": "en",
+ "hr": "en",
}
DEFAULT_LANG = "x"
+
```
# 4. Acknowledgement
- Inspired by [LlmKira/fast-langdetect](https://github.com/LlmKira/fast-langdetect)
-- Text segmentation depends on [segment-any-text/wtpsplit](https://github.com/segment-any-text/wtpsplit) and [google/budoux](https://github.com/google/budoux)
-- Language detection depends on [zafercavdar/fasttext-langdetect](https://github.com/zafercavdar/fasttext-langdetect) and [Mimino666/langdetect](https://github.com/Mimino666/langdetect) (fix miss detecting Chinese as Korean in [DoodleBears/langdetect](https://github.com/DoodleBears/langdetect))
+- Text segmentation depends on [google/budoux](https://github.com/google/budoux)
+- Language detection depends on [zafercavdar/fasttext-langdetect](https://github.com/zafercavdar/fasttext-langdetect) and [lingua-py](https://github.com/pemistahl/lingua-py)
diff --git a/event_log.txt b/event_log.txt
new file mode 100644
index 0000000..e69de29
diff --git a/setup.py b/setup.py
index 3a8ed41..b38c9b7 100644
--- a/setup.py
+++ b/setup.py
@@ -13,7 +13,7 @@ def read(*relpath):
setup(
name="split_lang",
- version="1.2.0",
+ version="1.3.0",
description="A package for splitting sentences by language (concatenating over-split substrings based on their language)",
long_description=read("README.md"),
long_description_content_type="text/markdown",
@@ -23,9 +23,8 @@ def read(*relpath):
license="MIT",
packages=find_packages(),
install_requires=[
- "langdetect-py",
"fast_langdetect",
- "wtpsplit",
+ "lingua-language-detector",
"pydantic",
"budoux",
],
diff --git a/split-lang-benchmark.ipynb b/split-lang-benchmark.ipynb
new file mode 100644
index 0000000..ae49b8a
--- /dev/null
+++ b/split-lang-benchmark.ipynb
@@ -0,0 +1,440 @@
+{
+ "cells": [
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "# Import Language Detection Package"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 272,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "import polyglot\n",
+ "import langdetect\n",
+ "import fast_langdetect\n",
+ "from lingua import Language, LanguageDetectorBuilder\n",
+ "\n",
+ "detector = LanguageDetectorBuilder.from_all_languages().build()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "# Import Text Split Package"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 273,
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stderr",
+ "output_type": "stream",
+ "text": [
+ "c:\\Users\\admin\\.conda\\envs\\melotts\\lib\\site-packages\\wtpsplit\\__init__.py:45: DeprecationWarning: You are using WtP, the old sentence segmentation model. It is highly encouraged to use SaT instead due to strongly improved performance and efficiency. See https://github.com/segment-any-text/wtpsplit for more info. To ignore this warning, set ignore_legacy_warning=True.\n",
+ " warnings.warn(\n",
+ "c:\\Users\\admin\\.conda\\envs\\melotts\\lib\\site-packages\\sklearn\\base.py:376: InconsistentVersionWarning: Trying to unpickle estimator LogisticRegression from version 1.2.2 when using version 1.5.0. This might lead to breaking code or invalid results. Use at your own risk. For more info please refer to:\n",
+ "https://scikit-learn.org/stable/model_persistence.html#security-maintainability-limitations\n",
+ " warnings.warn(\n"
+ ]
+ }
+ ],
+ "source": [
+ "from wtpsplit import SaT, WtP\n",
+ "sat = SaT(\"sat-1l-sm\")\n",
+ "sat.half().to(\"cuda\")\n",
+ "wtp = WtP(\"wtp-bert-mini\")\n",
+ "import budoux"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 274,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "text = \"你喜欢看アニメ吗\""
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 275,
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stderr",
+ "output_type": "stream",
+ "text": [
+ "100%|██████████| 1/1 [00:00<00:00, 110.31it/s]\n"
+ ]
+ },
+ {
+ "data": {
+ "text/plain": [
+ "['你', '喜欢看', 'アニメ', '吗']"
+ ]
+ },
+ "execution_count": 275,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "wtp.split(text_or_texts=text, threshold=4e-5, verbose=True)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 290,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "texts_with_digit = [\n",
+ " \"你喜欢看アニメ吗?\",\n",
+ " \"衬衫的价格是9.15便士\",\n",
+ " \"衬衫的价格是233亿元\",\n",
+ " \"衬衫的价格是233亿元人民币\",\n",
+ "]\n",
+ "\n",
+ "texts_zh_jp_ko_en = [\n",
+ " \"我是 VGroupChatBot,一个旨在支持多人通信的助手,通过可视化消息来帮助团队成员更好地交流。我可以帮助团队成员更好地整理和共享信息,特别是在讨论、会议和Brainstorming等情况下。你好我的名字是西野くまですmy name is bob很高兴认识你どうぞよろしくお願いいたします「こんにちは」是什么意思。\",\n",
+ " \"你好,我的名字是西野くまです。I am from Tokyo, 日本の首都。今天的天气非常好,sky is clear and sunny。おはようございます、皆さん!我们一起来学习吧。Learning languages can be fun and exciting。昨日はとても忙しかったので、今日は少しリラックスしたいです。Let's take a break and enjoy some coffee。中文、日本語、and English are three distinct languages, each with its own unique charm。希望我们能一起进步,一起成长。Let's keep studying and improving our language skills together. ありがとう!\",\n",
+ " \"你好,今日はどこへ行きますか?\",\n",
+ " \"你好今日はどこへ行きますか?\",\n",
+ " \"我的名字是田中さんです。\",\n",
+ " \"我喜欢吃寿司和拉面おいしいです。\",\n",
+ " \"今天の天気はとてもいいですね。\",\n",
+ " \"我在学习日本語少し難しいです。\",\n",
+ " \"日语真是おもしろい啊\",\n",
+ " \"你喜欢看アニメ吗?\",\n",
+ " \"我想去日本旅行、特に京都に行きたいです。\",\n",
+ " \"昨天見た映画はとても感動的でした。我朋友是日本人彼はとても優しいです。\",\n",
+ " \"我们一起去カラオケ吧、楽しそうです。\",\n",
+ " \"我的家在北京、でも、仕事で東京に住んでいます。\",\n",
+ " \"我在学做日本料理、日本料理を作るのを習っています。\",\n",
+ " \"你会说几种语言、何ヶ国語話せますか?\",\n",
+ " \"我昨天看了一本书、その本はとても面白かったです。\",\n",
+ " \"你最近好吗、最近どうですか?\",\n",
+ " \"我在学做日本料理와 한국 요리、日本料理を作るのを習っています。\",\n",
+ " \"你会说几种语言、何ヶ国語話せますか?몇 개 언어를 할 수 있어요?\",\n",
+ " \"我昨天看了一本书、その本はとても面白かったです。어제 책을 읽었는데, 정말 재미있었어요。\",\n",
+ " \"我们一起去逛街와 쇼핑、買い物に行きましょう。쇼핑하러 가요。\",\n",
+ " \"你最近好吗、最近どうですか?요즘 어떻게 지내요?\",\n",
+ "]\n",
+ "\n",
+ "texts_zh_jp = [\n",
+ " \"你好今日はどこへ行きますか\",\n",
+ " \"我的名字是田中さんです\",\n",
+ " \"我喜欢吃寿司和拉面おいしいです\",\n",
+ " \"今天の天気はとてもいいですね\",\n",
+ " \"我在学习日本語少し難しいです\",\n",
+ " \"日语真是おもしろい啊\",\n",
+ " \"你喜欢看アニメ吗\",\n",
+ " \"我想去日本旅行特に京都に行きたいです\",\n",
+ " \"昨天見た映画はとても感動的でした\",\n",
+ " \"我朋友是日本人彼はとても優しいです\",\n",
+ " \"我们一起去カラオケ吧\",\n",
+ " \"我的家在北京でも仕事で東京に住んでいます\",\n",
+ " \"我的名字是西野くまです\",\n",
+ " \"我的名字是西野くまですよろしくお願いいたします\",\n",
+ " \"好吃美味しい上手い\",\n",
+ " \"我给你送的手紙\",\n",
+ " \"真是面白い\",\n",
+ " \"春の花香り\",\n",
+ " \"何ヶ国語話せますか\",\n",
+ "]\n",
+ "\n",
+ "texts_de_fr_en = [\n",
+ " \"Ich liebe Paris, c'est une belle ville, and the food is amazing!\",\n",
+ " \"Berlin ist wunderbar, je veux y retourner, and explore more.\",\n",
+ " \"Bonjour, wie geht's dir today?\",\n",
+ " \"Die Musik hier ist fantastisch, la musique est superbe, and I enjoy it a lot.\",\n",
+ " \"Guten Morgen, je t'aime, have a great day!\",\n",
+ " \"Das Wetter ist heute schön, il fait beau aujourd'hui, and it's perfect for a walk.\",\n",
+ " \"Ich mag dieses Buch, ce livre est intéressant, and it has a great story.\",\n",
+ " \"Vielen Dank, merci beaucoup, for your help.\",\n",
+ " \"Wir reisen nach Deutschland, nous voyageons en Allemagne, and we are excited.\",\n",
+ " \"Ich bin müde, je suis fatigué, and I need some rest.\",\n",
+ " \"Ich liebe Paris c'est une belle ville and the food is amazing!\",\n",
+ " \"Berlin ist wunderbar je veux y retourner and explore more.\",\n",
+ " \"Bonjour wie geht's dir today?\",\n",
+ " \"Die Musik hier ist fantastisch la musique est superbe and I enjoy it a lot.\",\n",
+ " \"Guten Morgen je t'aime have a great day!\",\n",
+ " \"Das Wetter ist heute schön il fait beau aujourd'hui and it's perfect for a walk.\",\n",
+ " \"Ich mag dieses Buch ce livre est intéressant and it has a great story.\",\n",
+ " \"Vielen Dank merci beaucoup for your help.\",\n",
+ " \"Wir reisen nach Deutschland nous voyageons en Allemagne and we are excited.\",\n",
+ " \"Ich bin müde je suis fatigué and I need some rest.\",\n",
+ "]\n",
+ "\n",
+ "texts = texts_zh_jp_ko_en + texts_de_fr_en + texts_with_digit"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 291,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "import re\n",
+ "\n",
+ "chinese_char_pattern = re.compile(r\"[\\u4e00-\\u9fff]\")\n",
+ "hangul_pattern = re.compile(r\"[\\uac00-\\ud7af]\")\n",
+ "hiragana_pattern = re.compile(r\"[\\u3040-\\u309f]\")\n",
+ "katakana_pattern = re.compile(r\"[\\u30a0-\\u30ff]\")\n",
+ "\n",
+ "\n",
+ "def contains_chinese_char(text: str):\n",
+ " return bool(chinese_char_pattern.search(text))\n",
+ "\n",
+ "\n",
+ "def _contains_hiragana(text: str):\n",
+ " return bool(hiragana_pattern.search(text))\n",
+ "\n",
+ "\n",
+ "def _contains_katakana(text: str):\n",
+ " return bool(katakana_pattern.search(text))\n",
+ "\n",
+ "\n",
+ "def contains_ja(text):\n",
+ " if (\n",
+ " _contains_hiragana(text)\n",
+ " or _contains_katakana(text)\n",
+ " ):\n",
+ " return True\n",
+ " return False"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 292,
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "----------jp_budoux_parser\n",
+ "[['你好', '今日は', 'どこへ', '行きますか'], ['我的名字是田中さんです'], ['我喜欢吃寿司和拉面', 'おいしいです'], ['今天の', '天気は', 'とても', 'いいですね'], ['我在学习日本語少し', '難しいです'], ['日语真是おもしろい', '啊'], ['你喜欢看アニメ吗'], ['我想去日本旅行特に', '京都に', '行きたいです'], ['昨天見た', '映画は', 'とても', '感動的でした'], ['我朋友是日本人', '彼は', 'とても', '優しいです'], ['我们一起去カラオケ吧'], ['我的家在北京でも', '仕事で', '東京に', '住んでいます'], ['我的名字是西野くまです'], ['我的名字是西野くまですよろしく', 'お願い', 'いたします'], ['好吃美味しい', '上手い'], ['我给你送的手紙'], ['真是面白い'], ['春の', '花香り'], ['何ヶ国語話せますか']]\n",
+ "----------jp_budoux_parser+zh_budoux_parser\n",
+ "['你', '好', '今日', 'は', 'どこへ', '行きますか']\n",
+ "['我', '的', '名字', '是', '田', '中', 'さんです']\n",
+ "['我', '喜欢', '吃', '寿司', '和', '拉面', 'おいしいです']\n",
+ "['今天', 'の', '天', '気は', 'とても', 'いいですね']\n",
+ "['我', '在', '学习', '日本', '語少', 'し', '難しいです']\n",
+ "['日', '语真', '是', 'おも', 'しろい', '啊']\n",
+ "['你', '喜欢', '看', 'アニメ', '吗']\n",
+ "['我', '想', '去', '日本', '旅行', '特に', '京', '都', 'に', '行きたいです']\n",
+ "['昨天', '見た', '映', '画', 'は', 'とても', '感動', '的', 'でし', 'た']\n",
+ "['我', '朋友', '是', '日本', '人', '彼は', 'とても', '優しいです']\n",
+ "['我们', '一起', '去', 'カラオケ', '吧']\n",
+ "['我', '的', '家', '在', '北京', 'でも', '仕事で', '東京', 'に', '住ん', 'でいます']\n",
+ "['我', '的', '名字', '是', '西野', 'くまです']\n",
+ "['我', '的', '名字', '是', '西野', 'くまですよろしく', 'お願い', 'いたします']\n",
+ "['好', '吃', '美味', 'しい', '上', '手い']\n",
+ "['我', '给', '你', '送', '的', '手紙']\n",
+ "['真', '是', '面白い']\n",
+ "['春の', '花香り']\n",
+ "['何', 'ヶ国', '語話', 'せますか']\n",
+ "[['你', '好', '今日', 'は', 'どこへ', '行きますか'], ['我', '的', '名字', '是', '田', '中', 'さんです'], ['我', '喜欢', '吃', '寿司', '和', '拉面', 'おいしいです'], ['今天', 'の', '天', '気は', 'とても', 'いいですね'], ['我', '在', '学习', '日本', '語少', 'し', '難しいです'], ['日', '语真', '是', 'おも', 'しろい', '啊'], ['你', '喜欢', '看', 'アニメ', '吗'], ['我', '想', '去', '日本', '旅行', '特に', '京', '都', 'に', '行きたいです'], ['昨天', '見た', '映', '画', 'は', 'とても', '感動', '的', 'でし', 'た'], ['我', '朋友', '是', '日本', '人', '彼は', 'とても', '優しいです'], ['我们', '一起', '去', 'カラオケ', '吧'], ['我', '的', '家', '在', '北京', 'でも', '仕事で', '東京', 'に', '住ん', 'でいます'], ['我', '的', '名字', '是', '西野', 'くまです'], ['我', '的', '名字', '是', '西野', 'くまですよろしく', 'お願い', 'いたします'], ['好', '吃', '美味', 'しい', '上', '手い'], ['我', '给', '你', '送', '的', '手紙'], ['真', '是', '面白い'], ['春の', '花香り'], ['何', 'ヶ国', '語話', 'せますか']]\n",
+ "----------jp_budoux_parser+zh_budoux_parser+combine single to left\n",
+ "['你好', '今日', 'はどこへ行きますか']\n",
+ "['我的', '名字是田中', 'さんです']\n",
+ "['我喜欢吃', '寿司和', '拉面', 'おいしいです']\n",
+ "['今天', 'の', '天', '気はとてもいいですね']\n",
+ "['我在', '学习', '日本', '語少', 'し難しいです']\n",
+ "['日语真是', 'おもしろい', '啊']\n",
+ "['你喜欢看', 'アニメ', '吗']\n",
+ "['我想去', '日本', '旅行', '特に', '京都', 'に行きたいです']\n",
+ "['昨天', '見た', '映画', 'はとても', '感動的', 'でした']\n",
+ "['我朋友是', '日本人', '彼はとても優しいです']\n",
+ "['我们', '一起去', 'カラオケ', '吧']\n",
+ "['我的家在', '北京', 'でも仕事で', '東京', 'に住んでいます']\n",
+ "['我的', '名字是', '西野', 'くまです']\n",
+ "['我的', '名字是', '西野', 'くまですよろしくお願いいたします']\n",
+ "['好吃', '美味', 'しい', '上', '手い']\n",
+ "['我给你送的', '手紙']\n",
+ "['真是', '面白い']\n",
+ "['春の花香り']\n",
+ "['何ヶ国', '語話', 'せますか']\n",
+ "[['你好', '今日', 'はどこへ行きますか'], ['我的', '名字是田中', 'さんです'], ['我喜欢吃', '寿司和', '拉面', 'おいしいです'], ['今天', 'の', '天', '気はとてもいいですね'], ['我在', '学习', '日本', '語少', 'し難しいです'], ['日语真是', 'おもしろい', '啊'], ['你喜欢看', 'アニメ', '吗'], ['我想去', '日本', '旅行', '特に', '京都', 'に行きたいです'], ['昨天', '見た', '映画', 'はとても', '感動的', 'でした'], ['我朋友是', '日本人', '彼はとても優しいです'], ['我们', '一起去', 'カラオケ', '吧'], ['我的家在', '北京', 'でも仕事で', '東京', 'に住んでいます'], ['我的', '名字是', '西野', 'くまです'], ['我的', '名字是', '西野', 'くまですよろしくお願いいたします'], ['好吃', '美味', 'しい', '上', '手い'], ['我给你送的', '手紙'], ['真是', '面白い'], ['春の花香り'], ['何ヶ国', '語話', 'せますか']]\n"
+ ]
+ }
+ ],
+ "source": [
+ "from typing import List\n",
+ "\n",
+ "\n",
+ "zh_budoux_parser = budoux.load_default_simplified_chinese_parser()\n",
+ "zh_tc_budoux_parser = budoux.load_default_traditional_chinese_parser()\n",
+ "jp_budoux_parser = budoux.load_default_japanese_parser()\n",
+ "\n",
+ "# print(\"zh_budoux_parser----------\")\n",
+ "# for text in texts_zh_jp:\n",
+ "# print(zh_budoux_parser.parse(text))\n",
+ " \n",
+ "# print(\"zh_tc_budoux_parser----------\")\n",
+ "# for text in texts_zh_jp:\n",
+ "# print(zh_tc_budoux_parser.parse(text))\n",
+ " \n",
+ "# print(\"jp_budoux_parser----------\")\n",
+ "# for text in texts_zh_jp:\n",
+ "# print(jp_budoux_parser.parse(text))\n",
+ "\n",
+ "\n",
+ "\n",
+ "# print(\"----------wtp\")\n",
+ "# for text in texts_zh_jp:\n",
+ "# print(wtp.split(text_or_texts=text, threshold=5e-4, verbose=False))\n",
+ "print(\"----------jp_budoux_parser\")\n",
+ "\n",
+ "splitted_texts_jp = []\n",
+ "for text in texts_zh_jp:\n",
+ " jp_split_text = jp_budoux_parser.parse(text)\n",
+ " splitted_texts_jp.append(jp_split_text)\n",
+ "print(splitted_texts_jp)\n",
+ "print(\"----------jp_budoux_parser+zh_budoux_parser\")\n",
+ "\n",
+ "splitted_texts_zh_jp = []\n",
+ "for substrings in splitted_texts_jp:\n",
+ " words = []\n",
+ " for substring in substrings:\n",
+ " words.extend(zh_budoux_parser.parse(substring))\n",
+ " print(words)\n",
+ " splitted_texts_zh_jp.append(words)\n",
+ "print(splitted_texts_zh_jp)\n",
+ "\n",
+ "print(\"----------jp_budoux_parser+zh_budoux_parser+combine single to left\")\n",
+ "pre_split_texts:List[List[str]] = []\n",
+ "for words in splitted_texts_zh_jp:\n",
+ " new_words = [words[0]]\n",
+ " for sub_text in words[1:]:\n",
+ " is_left_ja = contains_ja(new_words[-1])\n",
+ " is_cur_ja = contains_ja(sub_text)\n",
+ " is_both_same_lang = is_left_ja == is_cur_ja\n",
+ " is_both_ja = is_left_ja == True and is_both_same_lang\n",
+ " is_both_zh = is_left_ja == False and is_both_same_lang\n",
+ " if is_both_ja: # both substring is katakana or hiragana, then concat\n",
+ " new_words[-1] += sub_text\n",
+ " elif is_both_zh and len(sub_text) == 1:\n",
+ " # NOTE: both substring is full Chinese character, and current one is only one character\n",
+ " # NOTE: 90% is because we first use ja_parser then zh_parser (from `budoux`)\n",
+ " # NOTE: Since kanji in Japanese usually not appear by them self, Single character is CAUSED BY zh_parser \n",
+ " # NOTE: So we let single character concat together, if both substring did not contain kana\n",
+ " new_words[-1] += sub_text\n",
+ " else:\n",
+ " new_words.append(sub_text)\n",
+ " \n",
+ " if len(new_words) >= 2 and len(new_words[0]) == 1:\n",
+ " new_words[1] = new_words[0] + new_words[1]\n",
+ " new_words = new_words[1:]\n",
+ " pre_split_texts.append(new_words) \n",
+ " print(new_words) \n",
+ "print(pre_split_texts)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 293,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "from lingua import Language, LanguageDetectorBuilder\n",
+ "all_detector = LanguageDetectorBuilder.from_all_languages().build()"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 294,
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "zh:你好|ja:今日|ja:はどこへ行きますか|\n",
+ "zh:我的|zh:名字是田中|ja:さんです|\n",
+ "zh:我喜欢吃|ja:寿司和|zh:拉面|ja:おいしいです|\n",
+ "zh:今天|ja:の|zh:天|ja:気はとてもいいですね|\n",
+ "zh:我在|zh:学习|ja:日本|zh:語少|ja:し難しいです|\n",
+ "zh:日语真是|ja:おもしろい|zh:啊|\n",
+ "zh:你喜欢看|ja:アニメ|zh:吗|\n",
+ "zh:我想去|ja:日本|ja:旅行|ja:特に|ja:京都|ja:に行きたいです|\n",
+ "zh:昨天|ja:見た|ja:映画|ja:はとても|zh:感動的|ja:でした|\n",
+ "zh:我朋友是|zh:日本人|ja:彼はとても優しいです|\n",
+ "zh:我们|zh:一起去|ja:カラオケ|zh:吧|\n",
+ "zh:我的家在|ja:北京|ja:でも仕事で|ja:東京|ja:に住んでいます|\n",
+ "zh:我的|zh:名字是|ja:西野|ja:くまです|\n",
+ "zh:我的|zh:名字是|ja:西野|ja:くまですよろしくお願いいたします|\n",
+ "zh:好吃|zh:美味|ja:しい|ja:上|ja:手い|\n",
+ "zh:我给你送的|ja:手紙|\n",
+ "zh:真是|ja:面白い|\n",
+ "ja:春の花香り|\n",
+ "zh:何ヶ国|ja:語話|ja:せますか|\n"
+ ]
+ }
+ ],
+ "source": [
+ "from langdetect.lang_detect_exception import LangDetectException\n",
+ "\n",
+ "def lingua_lang_detect_all(text: str) -> str:\n",
+ " language: Language | None = all_detector.detect_language_of(text=text)\n",
+ " if language is None:\n",
+ " return \"x\"\n",
+ " return language.iso_code_639_1.name.lower()\n",
+ "\n",
+ "def fast_lang_detect(text: str) -> str:\n",
+ " result = str(fast_langdetect.detect(text, low_memory=False)[\"lang\"])\n",
+ " result = result.lower()\n",
+ " return result\n",
+ "\n",
+ "def lang_detect(text: str) -> str:\n",
+ " try:\n",
+ " result = str(langdetect.detect(text))\n",
+ " result = result.lower()\n",
+ " return result\n",
+ " except LangDetectException as e:\n",
+ " return \"zh\"\n",
+ " except Exception as e:\n",
+ " pass\n",
+ " return \"x\"\n",
+ "\n",
+ "\n",
+ "for substrings in pre_split_texts:\n",
+ " for substring in substrings:\n",
+ " # lang = lingua_lang_detect_all(substring)\n",
+ " lang = fast_lang_detect(substring)\n",
+ " \n",
+ " print(f\"{lang}:{substring}\",end='|')\n",
+ " print()"
+ ]
+ }
+ ],
+ "metadata": {
+ "kernelspec": {
+ "display_name": "melotts",
+ "language": "python",
+ "name": "python3"
+ },
+ "language_info": {
+ "codemirror_mode": {
+ "name": "ipython",
+ "version": 3
+ },
+ "file_extension": ".py",
+ "mimetype": "text/x-python",
+ "name": "python",
+ "nbconvert_exporter": "python",
+ "pygments_lexer": "ipython3",
+ "version": "3.10.14"
+ }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}
diff --git a/split_lang/__init__.py b/split_lang/__init__.py
index fb61196..26d990b 100644
--- a/split_lang/__init__.py
+++ b/split_lang/__init__.py
@@ -1,3 +1,3 @@
-from .detect_lang.detector import DEFAULT_LANG, LANG_MAP
-from .split.model import SubString, SubStringSection
-from .split.splitter import TextSplitter, split, split_by_lang
+from .split.splitter import LangSplitter
+from .model import SubString, SubStringSection, LangSectionType
+from .config import DEFAULT_LANG, DEFAULT_LANG_MAP
diff --git a/split_lang/config.py b/split_lang/config.py
new file mode 100644
index 0000000..cdf312a
--- /dev/null
+++ b/split_lang/config.py
@@ -0,0 +1,14 @@
+DEFAULT_LANG_MAP = {
+ "zh": "zh",
+ "yue": "zh", # 粤语
+ "wuu": "zh", # 吴语
+ "zh-cn": "zh",
+ "zh-tw": "x",
+ "ko": "ko",
+ "ja": "ja",
+ "de": "de",
+ "fr": "fr",
+ "en": "en",
+ "hr": "en",
+}
+DEFAULT_LANG = "x"
diff --git a/split_lang/detect_lang/__init__.py b/split_lang/detect_lang/__init__.py
index f5847cc..b8618d6 100644
--- a/split_lang/detect_lang/__init__.py
+++ b/split_lang/detect_lang/__init__.py
@@ -1 +1 @@
-from .detector import LANG_MAP, DEFAULT_LANG, detect_lang, detect_lang_combined
+from .detector import detect_lang_combined, possible_detection_list
diff --git a/split_lang/detect_lang/detector.py b/split_lang/detect_lang/detector.py
index 496377d..c1f27a1 100644
--- a/split_lang/detect_lang/detector.py
+++ b/split_lang/detect_lang/detector.py
@@ -1,44 +1,20 @@
import logging
-import langdetect
+from typing import List
import fast_langdetect
-from langdetect.lang_detect_exception import LangDetectException
+from lingua import LanguageDetectorBuilder
+
+from ..model import LangSectionType
+from ..split.utils import contains_ja
-logger = logging.getLogger(__name__)
-LANG_MAP = {
- "zh": "zh",
- "yue": "zh", # 粤语
- "wuu": "zh", # 吴语
- "zh-cn": "zh",
- "zh-tw": "x",
- "ko": "ko",
- "ja": "ja",
- "de": "de",
- "fr": "fr",
- "en": "en",
-}
-DEFAULT_LANG = "x"
-
-
-def detect_lang(text: str) -> str:
- try:
- result = str(langdetect.detect(text))
- result = result.lower()
- return result
- except LangDetectException as e:
- logger.debug(
- "Language detection of `%s` using `langdetect.detect(text)` failed: %s",
- text,
- e,
- )
- return "zh"
- except Exception as e:
- logger.debug(
- "An unexpected error occurred of `%s` using `langdetect.detect(text)`: %s",
- text,
- e,
- )
- return "x"
+all_detector = (
+ LanguageDetectorBuilder.from_all_languages()
+ .with_preloaded_language_models()
+ .build()
+)
+
+
+logger = logging.getLogger(__name__)
def fast_lang_detect(text: str) -> str:
@@ -47,8 +23,24 @@ def fast_lang_detect(text: str) -> str:
return result
+def lingua_lang_detect_all(text: str) -> str:
+ language = all_detector.detect_language_of(text=text)
+ if language is None:
+ return "x"
+ return language.iso_code_639_1.name.lower()
+
+
# For example '衬衫' cannot be detected by `langdetect`, and `fast_langdetect` will detect it as 'en'
-def detect_lang_combined(text: str, text_len_threshold=3) -> str:
- if len(text) <= text_len_threshold:
- return detect_lang(text)
- return fast_lang_detect(text=text)
+def detect_lang_combined(text: str, lang_section_type: LangSectionType) -> str:
+ if lang_section_type is LangSectionType.ZH_JA:
+ if contains_ja(text):
+ return "ja"
+ return fast_lang_detect(text)
+ return lingua_lang_detect_all(text)
+
+
+def possible_detection_list(text) -> List[str]:
+ languages = []
+ languages.append(fast_lang_detect(text))
+ languages.append(lingua_lang_detect_all(text))
+ return languages
diff --git a/split_lang/split/model.py b/split_lang/model.py
similarity index 98%
rename from split_lang/split/model.py
rename to split_lang/model.py
index 58ae21e..034ef49 100644
--- a/split_lang/split/model.py
+++ b/split_lang/model.py
@@ -10,6 +10,7 @@ class LangSectionType(Enum):
PUNCTUATION = "punctuation"
DIGIT = "digit"
OTHERS = "others"
+ ALL = "all"
class SubString(BaseModel):
diff --git a/split_lang/split/__init__.py b/split_lang/split/__init__.py
index 5a7f481..8210e04 100644
--- a/split_lang/split/__init__.py
+++ b/split_lang/split/__init__.py
@@ -1,3 +1,2 @@
-from .model import SubString, SubStringSection, LangSectionType
-from .splitter import split_by_lang, split, TextSplitter
-from .utils import PUNCTUATION, DEFAULT_THRESHOLD, contains_hangul, contains_zh_ja
+from .splitter import LangSplitter
+from .utils import PUNCTUATION, contains_hangul, contains_zh_ja, contains_ja
diff --git a/split_lang/split/splitter.py b/split_lang/split/splitter.py
index 19de6ba..2d77dd5 100644
--- a/split_lang/split/splitter.py
+++ b/split_lang/split/splitter.py
@@ -2,13 +2,14 @@
from typing import Dict, List
import budoux
-from wtpsplit import SaT, WtP
-budoux_parser = budoux.load_default_simplified_chinese_parser()
+zh_budoux_parser = budoux.load_default_simplified_chinese_parser()
+jp_budoux_parser = budoux.load_default_japanese_parser()
-from ..detect_lang.detector import DEFAULT_LANG, LANG_MAP, detect_lang_combined
-from .model import LangSectionType, SubString, SubStringSection
-from .utils import DEFAULT_THRESHOLD, PUNCTUATION, contains_hangul, contains_zh_ja
+from ..config import DEFAULT_LANG, DEFAULT_LANG_MAP
+from ..detect_lang.detector import detect_lang_combined, possible_detection_list
+from ..model import LangSectionType, SubString, SubStringSection
+from .utils import PUNCTUATION, contains_hangul, contains_zh_ja, contains_ja
logging.basicConfig(
level=logging.WARNING,
@@ -17,562 +18,570 @@
logger = logging.getLogger(__name__)
-class TextSplitter:
- """
- Base class for splitting text into substrings.
-
- This class provides a default implementation using a WtP model.
- Users can override the `split` method to implement their own custom splitter.
-
- Attributes:
- wtp_split_model (WtP | SaT): The model used for splitting text.
- """
+class LangSplitter:
+ def __init__(
+ self,
+ lang_map: Dict = None,
+ default_lang: str = DEFAULT_LANG,
+ debug: bool = False,
+ merge_across_punctuation: bool = True,
+ merge_across_digit: bool = True,
+ ) -> None:
+ self.lang_map = lang_map if lang_map is not None else DEFAULT_LANG_MAP
+ self.default_lang = default_lang
+ self.debug = debug
+ self.merge_across_punctuation = merge_across_punctuation
+ self.merge_across_digit = merge_across_digit
+
+ def split_by_lang(
+ self,
+ text: str,
+ ) -> List[SubString]:
+
+ sections = self.split(text=text)
+ substrings: List[SubString] = []
+ for section in sections:
+ substrings.extend(section.substrings)
+
+ substrings = self._merge_digit(substrings=substrings)
+
+ if self.merge_across_digit:
+ substrings = self._merge_substring_across_digit(substrings=substrings)
+
+ if self.merge_across_punctuation:
+ substrings = self._merge_substrings_across_punctuation(
+ substrings=substrings,
+ )
- def __init__(self, wtp_split_model: WtP | SaT = WtP("wtp-bert-mini")):
- self.wtp_split_model = wtp_split_model
+ return substrings
def split(
- self, text: str, threshold: float = DEFAULT_THRESHOLD, verbose=False
- ) -> List[str]:
- """
- Split the given text into substrings.
-
- Args:
- text (str): The text to be split.
- threshold (float, optional): The threshold for splitting. Defaults to `DEFAULT_THRESHOLD`.
- verbose (bool, optional): If True, provides verbose output. Defaults to False.
-
- Returns:
- List[str]: A list of substrings.
-
- Note:
- Override this method to implement a custom splitter.
- """
- if text in PUNCTUATION or text.isdigit():
- return text
- return self.wtp_split_model.split(
- text_or_texts=text, threshold=threshold, verbose=verbose
- )
-
-
-def split_by_lang(
- text: str,
- threshold: float = DEFAULT_THRESHOLD,
- lang_map: Dict = None,
- default_lang: str = DEFAULT_LANG,
- verbose=False,
- splitter: TextSplitter = TextSplitter(),
- merge_across_punctuation: bool = False,
- merge_across_digit: bool = True,
-) -> List[SubString]:
- """_summary_
-
- Args:
- text (str): _description_
- threshold (float, optional): _description_. Defaults to DEFAULT_THRESHOLD.
- lang_map (Dict, optional): _description_. Defaults to None.
- default_lang (str, optional): default language to fallback. Defaults to `DEFAULT_LANG`.
- verbose (bool, optional): _description_. Defaults to False.
- splitter (TextSplitter, optional): _description_. Defaults to default_sentence_splitter.
-
- Returns:
- List[SubString]: substring with .lang and .text
- """
- if merge_across_digit is None:
- merge_across_digit = merge_across_punctuation
-
- sections = split(
- text=text,
- threshold=threshold,
- lang_map=lang_map,
- default_lang=default_lang,
- verbose=verbose,
- splitter=splitter,
- )
- substrings: List[SubString] = []
- for section in sections:
- substrings.extend(section.substrings)
-
- substrings = _merge_digit(substrings=substrings)
-
- if merge_across_digit:
- substrings = _merge_substring_across_digit(substrings=substrings)
-
- if merge_across_punctuation:
- substrings = _merge_substrings_across_punctuation(
- substrings=substrings,
- )
-
- return substrings
-
-
-def split(
- text: str,
- threshold: float = DEFAULT_THRESHOLD,
- lang_map: Dict = None,
- default_lang: str = DEFAULT_LANG,
- verbose=False,
- splitter: TextSplitter = TextSplitter(),
-) -> List[SubStringSection]:
- """using
- 1. `wtpsplit` to split sentences into 'small' substring
- 2. concat substring based on language using `fasttext` and `langdetect`
-
- Args:
- text (str): text to split
- threshold (float, optional): the lower the more separated (more) substring will return. Defaults to DEFAULT_THRESHOLD (if your text contains no Chinese, Japanese, Korean, 4.9e-4 is suggested)
- lang_map (_type_, optional): mapping different language to same language for better result, if you know the range of your target languages. Defaults to None.
- default_lang (str, optional): default language to fallback. Defaults to `DEFAULT_LANG`.
- verbose (bool, optional): print the process. Defaults to False.
- splitter (TextSplitter, optional): sentence splitter. Defaults to default_sentence_splitter.
-
- Returns:
- List[SubStringSection]: Multiple sections (separate by punctuation), each section contains substring with .lang and .text
- """
- # MARK: pre split by languages
- # NOTE: since some language share some characters (e.g. Chinese and Japanese)
- # NOTE: while Korean has their own characters,
- # NOTE: For Cyrillic alphabet, Latin alphabet, a lot of languages share the alphabet
- pre_split_section = _pre_split(text=text)
- if verbose:
- logger.info("---------pre_split_section:")
+ self,
+ text: str,
+ ) -> List[SubStringSection]:
+ # MARK: pre split by languages
+ # NOTE: since some language share some characters (e.g. Chinese and Japanese)
+ # NOTE: while Korean has their own characters,
+ # NOTE: For Cyrillic alphabet, Latin alphabet, a lot of languages share the alphabet
+ pre_split_section = self._pre_split(text=text)
+ if self.debug:
+ logger.info("---------pre_split_section:")
+ for section in pre_split_section:
+ logger.info(section)
+
+ section_index = 0
for section in pre_split_section:
- logger.info(section)
-
- # MARK: wtpsplit / budoux
- section_index = 0
- for section in pre_split_section:
- section_len = len(section.text)
- if section.lang_section_type is LangSectionType.PUNCTUATION:
- section.substrings.append(
- SubString(
- is_digit=False,
- is_punctuation=True,
- text=section.text,
- lang="punctuation",
- index=section_index,
- length=section_len,
- )
- )
- elif section.lang_section_type is LangSectionType.DIGIT:
- section.substrings.append(
- SubString(
- is_digit=True,
- is_punctuation=False,
- text=section.text,
- lang="digit",
- index=section_index,
- length=section_len,
+ section_len = len(section.text)
+ if section.lang_section_type is LangSectionType.PUNCTUATION:
+ section.substrings.append(
+ SubString(
+ is_digit=False,
+ is_punctuation=True,
+ text=section.text,
+ lang="punctuation",
+ index=section_index,
+ length=section_len,
+ )
)
- )
- else:
- substrings: List[str] = []
- if section.lang_section_type is LangSectionType.OTHERS:
- substrings = splitter.split(
- text=section.text, threshold=threshold, verbose=verbose
+ else:
+ substrings: List[str] = []
+ lang_section_type = LangSectionType.OTHERS
+ if section.lang_section_type is LangSectionType.OTHERS:
+ substrings = self._parse_without_zh_ja(section.text)
+ elif section.lang_section_type is LangSectionType.KO:
+ substrings = [section.text]
+ lang_section_type = LangSectionType.KO
+ elif section.lang_section_type is LangSectionType.ZH_JA:
+ substrings = self._parse_zh_ja(section.text)
+ lang_section_type = LangSectionType.ZH_JA
+
+ substrings_with_lang = self._init_substr_lang(
+ texts=substrings,
+ lang_section_type=lang_section_type,
)
- elif section.lang_section_type is LangSectionType.KO:
- substrings = [section.text]
- elif section.lang_section_type is LangSectionType.ZH_JA:
- substrings = budoux_parser.parse(section.text)
-
- # MARK: initialize language detect
- substrings_with_lang = _init_substr_lang(
- texts=substrings, lang_map=lang_map
- )
- for substr in substrings_with_lang:
- substr.index += section_index
- section.substrings = substrings_with_lang
+ for substr in substrings_with_lang:
+ substr.index += section_index
+ section.substrings = substrings_with_lang
- section_index += section_len
+ section_index += section_len
- if verbose:
- logger.info("---------_init_substr_lang")
- for section in pre_split_section:
- logger.info(section)
-
- # MARK: smart merge substring together
- wtpsplit_section = pre_split_section
- for section in wtpsplit_section:
- if section.lang_section_type is LangSectionType.PUNCTUATION:
- continue
- smart_concat_result = _smart_merge(
- substr_list=section.substrings,
- lang_map=lang_map,
- default_lang=default_lang,
- )
- section.substrings.clear()
- section.substrings = smart_concat_result
- if verbose:
- logger.info("---------smart_concat_result")
- for section in wtpsplit_section:
- logger.info(section)
- return wtpsplit_section
+ if self.debug:
+ logger.info("---------_init_substr_lang")
+ for section in pre_split_section:
+ logger.info(section)
+ # MARK: smart merge substring together
+ wtpsplit_section = pre_split_section
+ for section in wtpsplit_section:
+ if section.lang_section_type is LangSectionType.PUNCTUATION:
+ continue
+ smart_concat_result = self._smart_merge(
+ substr_list=section.substrings,
+ lang_section_type=section.lang_section_type,
+ )
+ section.substrings.clear()
+ section.substrings = smart_concat_result
+ if self.debug:
+ logger.info("---------smart_concat_result")
+ for section in wtpsplit_section:
+ logger.info(section)
+ return wtpsplit_section
+
+ def _parse_without_zh_ja(self, text: str):
+ words: List[str] = []
+ exist_space = False
+ chars = []
+ for char in text:
+ if char.isspace() is False:
+ if exist_space:
+ words.append("".join(chars))
+ chars.clear()
+ exist_space = False
+ chars.append(char)
+ else:
+ exist_space = True
+ chars.append(char)
+ if len(chars) > 0:
+ words.append("".join(chars))
+
+ return words
+
+ def _parse_zh_ja(self, text):
+ splitted_texts_jp = []
+ splitted_texts_jp = jp_budoux_parser.parse(text)
+
+ splitted_texts_zh_jp = []
+ for substring in splitted_texts_jp:
+ splitted_texts_zh_jp.extend(zh_budoux_parser.parse(substring))
+
+ new_substrings = [splitted_texts_zh_jp[0]]
+ for substring in splitted_texts_zh_jp[1:]:
+ is_left_ja = contains_ja(new_substrings[-1])
+ is_cur_ja = contains_ja(substring)
+ is_both_same_lang = is_left_ja == is_cur_ja
+ is_both_ja = is_left_ja is True and is_both_same_lang
+ is_both_zh = is_left_ja is False and is_both_same_lang
+ if is_both_ja: # both substring is katakana or hiragana, then concat
+ new_substrings[-1] += substring
+ elif is_both_zh and len(substring) == 1:
+ # NOTE: both substring is full Chinese character, and current one is only one character
+ # NOTE: 90% is because we first use ja_parser then zh_parser (from `budoux`)
+ # NOTE: Since kanji in Japanese usually not appear by them self, Single character is CAUSED BY zh_parser
+ # NOTE: So we let single character concat together, if both substring did not contain kana
+ new_substrings[-1] += substring
+ else:
+ new_substrings.append(substring)
-def _pre_split(text: str) -> List[SubStringSection]:
- """
- 1. split Chinese, Japanese and Korean substring and other languages
- 2. split punctuation
+ if len(new_substrings) >= 2 and len(new_substrings[0]) == 1:
+ new_substrings[1] = new_substrings[0] + new_substrings[1]
+ new_substrings = new_substrings[1:]
- Args:
- text (str): input text
+ return new_substrings
- Returns:
- List[str]: list of substring
- """
- sections = []
- current_lang: LangSectionType = LangSectionType.OTHERS
- current_text = []
+ def _pre_split(self, text: str) -> List[SubStringSection]:
+ sections = []
+ current_lang: LangSectionType = LangSectionType.OTHERS
+ current_text = []
- def add_substring(lang_section_type: LangSectionType):
- if current_text:
- concat_text = "".join(current_text)
+ def add_substring(lang_section_type: LangSectionType):
+ if current_text:
+ concat_text = "".join(current_text)
- sections.append(
- SubStringSection(
- lang_section_type=lang_section_type,
- text=concat_text,
- substrings=[],
+ sections.append(
+ SubStringSection(
+ lang_section_type=lang_section_type,
+ text=concat_text,
+ substrings=[],
+ )
)
+ current_text.clear()
+
+ for char in text:
+ if char.isspace() is False:
+ if contains_zh_ja(char):
+ if current_lang != LangSectionType.ZH_JA:
+ add_substring(current_lang)
+ current_lang = LangSectionType.ZH_JA
+ elif contains_hangul(char):
+ if current_lang != LangSectionType.KO:
+ add_substring(current_lang)
+ current_lang = LangSectionType.KO
+ elif char in PUNCTUATION:
+ add_substring(current_lang)
+ current_lang = LangSectionType.PUNCTUATION
+ else:
+ if current_lang != LangSectionType.OTHERS:
+ add_substring(current_lang)
+ current_lang = LangSectionType.OTHERS
+ current_text.append(char)
+
+ add_substring(current_lang)
+ return sections
+
+ def _smart_merge(
+ self,
+ substr_list: List[SubString],
+ lang_section_type: LangSectionType,
+ ):
+ is_concat_complete = False
+ while is_concat_complete is False:
+ substr_list = self._smart_concat_logic(
+ substr_list,
+ lang_section_type=lang_section_type,
)
- # substrings.append("".join(current_text))
- current_text.clear()
+ is_concat_complete = True
- for char in text:
- if char.isspace() is False:
- if contains_zh_ja(char):
- if current_lang != LangSectionType.ZH_JA:
- add_substring(current_lang)
- current_lang = LangSectionType.ZH_JA
- elif contains_hangul(char):
- if current_lang != LangSectionType.KO:
- add_substring(current_lang)
- current_lang = LangSectionType.KO
- elif char in PUNCTUATION:
- add_substring(current_lang)
- current_lang = LangSectionType.PUNCTUATION
- else:
- if current_lang != LangSectionType.OTHERS:
- add_substring(current_lang)
- current_lang = LangSectionType.OTHERS
- current_text.append(char)
-
- add_substring(current_lang)
- return sections
-
-
-def _smart_merge(
- substr_list: List[SubString],
- lang_map: Dict = None,
- default_lang: str = DEFAULT_LANG,
-):
- if lang_map is None:
- lang_map = LANG_MAP
- is_concat_complete = False
- while is_concat_complete is False:
- substr_list = _smart_concat_logic(
- substr_list, lang_map=lang_map, default_lang=default_lang
- )
- is_concat_complete = True
-
- for index, block in enumerate(substr_list):
- if block.lang == "x":
- is_concat_complete = False
- break
- if index < len(substr_list) - 1:
- if substr_list[index].lang == substr_list[index + 1].lang:
+ for index, block in enumerate(substr_list):
+ if block.lang == "x":
is_concat_complete = False
break
- return substr_list
-
-
-# MARK: _init_substr_lang
-def _init_substr_lang(texts: List[str], lang_map: Dict = None) -> List[SubString]:
- substrings = []
- if lang_map is None:
- lang_map = LANG_MAP
-
- substring_index = 0
- for text in texts:
- length = len(text)
- if text in PUNCTUATION:
- substrings.append(
- SubString(
- is_punctuation=True,
- is_digit=False,
- lang="punctuation",
- text=text,
- length=length,
- index=substring_index,
+ if index < len(substr_list) - 1:
+ if substr_list[index].lang == substr_list[index + 1].lang:
+ is_concat_complete = False
+ break
+ return substr_list
+
+ # MARK: _init_substr_lang
+ def _init_substr_lang(
+ self,
+ texts: List[str],
+ lang_section_type: LangSectionType,
+ ) -> List[SubString]:
+ substrings = []
+
+ substring_index = 0
+ for text in texts:
+ length = len(text)
+ if text in PUNCTUATION:
+ substrings.append(
+ SubString(
+ is_punctuation=True,
+ is_digit=False,
+ lang="punctuation",
+ text=text,
+ length=length,
+ index=substring_index,
+ )
)
- )
- elif text.strip().isdigit():
- substrings.append(
- SubString(
- is_punctuation=False,
- is_digit=True,
- lang="digit",
- text=text,
- length=length,
- index=substring_index,
+ elif text.strip().isdigit():
+ substrings.append(
+ SubString(
+ is_punctuation=False,
+ is_digit=True,
+ lang="digit",
+ text=text,
+ length=length,
+ index=substring_index,
+ )
)
- )
- else:
- cur_lang = detect_lang_combined(text)
- cur_lang = lang_map.get(cur_lang, "x")
- substrings.append(
- SubString(
- is_digit=False,
- is_punctuation=False,
- lang=cur_lang,
- text=text,
- length=length,
- index=substring_index,
+ else:
+ cur_lang = detect_lang_combined(
+ text, lang_section_type=lang_section_type
+ )
+ cur_lang = self.lang_map.get(cur_lang, self.default_lang)
+ substrings.append(
+ SubString(
+ is_digit=False,
+ is_punctuation=False,
+ lang=cur_lang,
+ text=text,
+ length=length,
+ index=substring_index,
+ )
)
- )
-
- substring_index += length
- return substrings
-
-def _merge_middle_substr_to_two_side(substrings: List[SubString]):
- substr_len = len(substrings)
- if substr_len <= 2:
+ substring_index += length
return substrings
- for index in range(substr_len - 2):
- left_block = substrings[index]
- middle_block = substrings[index + 1]
- right_block = substrings[index + 2]
- if left_block.lang == right_block.lang and left_block.lang != "x":
- if len(middle_block.text) <= 1 or middle_block.lang == "x":
- substrings[index + 1].lang = left_block.lang
- return substrings
+ def _is_middle_short_and_two_side_long(
+ self, left: SubString, middle: SubString, right: SubString
+ ):
+ return middle.length <= 3 and left.length + right.length >= 6
+ def _is_cur_short_and_near_long(self, cur: SubString, near: SubString):
+ return cur.length <= 2 and near.length >= 6 and near.lang == "zh"
-def _merge_two_side_substr_to_near(substrings: List[SubString]):
- # Left
- is_lang_x = substrings[0].lang == "x"
- is_digit = substrings[0].is_digit
- is_too_short = len(substrings[0].text) <= 1
+ # MARK: _merge_middle_substr_to_two_side
+ def _merge_middle_substr_to_two_side(self, substrings: List[SubString]):
+ substr_len = len(substrings)
+ if substr_len <= 2:
+ return substrings
+ for index in range(substr_len - 2):
+ left_block = substrings[index]
+ middle_block = substrings[index + 1]
+ right_block = substrings[index + 2]
- is_need_merge_to_right = is_lang_x or is_digit or is_too_short
+ if left_block.lang == right_block.lang and left_block.lang != "x":
+ # if different detectors results contains near block's language, then combine
+
+ if (
+ left_block.lang in possible_detection_list(middle_block.text)
+ or self._is_middle_short_and_two_side_long(
+ left_block, middle_block, right_block
+ )
+ or middle_block.lang == "x"
+ ):
+ substrings[index + 1].lang = left_block.lang
+ return substrings
- if is_need_merge_to_right:
- substrings[0].lang = _find_nearest_lang_with_direction(
- substrings, 0, search_left=False
- )
- # Right
- is_lang_x = substrings[-1].lang == "x"
- is_digit = substrings[-1].is_digit
- is_too_short = len(substrings[-1].text) <= 1
+ # MARK: _merge_side_substr_to_near
+ def _merge_side_substr_to_near(self, substrings: List[SubString]):
+ # NOTE: Merge leftest substr
+ is_lang_x = substrings[0].lang == "x"
+ is_cur_short_and_near_long = False
+ is_possible_same_lang_with_near = False
+ if len(substrings) >= 2:
+ is_cur_short_and_near_long = self._is_cur_short_and_near_long(
+ substrings[0], substrings[1]
+ )
- is_need_merge_to_left = is_lang_x or is_digit or is_too_short
+ is_possible_same_lang_with_near = substrings[
+ 1
+ ].lang in possible_detection_list(substrings[0].text)
- if is_need_merge_to_left:
- substrings[-1].lang = _find_nearest_lang_with_direction(
- substrings, len(substrings) - 1, search_left=True
+ is_need_merge_to_right = (
+ is_lang_x or is_cur_short_and_near_long or is_possible_same_lang_with_near
)
- return substrings
+ if is_need_merge_to_right:
+ substrings[0].lang = self._get_nearest_lang_with_direction(
+ substrings, 0, search_left=False
+ )
+ # NOTE: Merge rightest substr
+ is_lang_x = substrings[-1].lang == "x"
+ is_cur_short_and_near_long = False
+ is_possible_same_lang_with_near = False
+ if len(substrings) >= 2:
+ is_cur_short_and_near_long = self._is_cur_short_and_near_long(
+ substrings[-1], substrings[-2]
+ )
+ is_possible_same_lang_with_near = substrings[
+ -2
+ ].lang in possible_detection_list(substrings[-1].text)
-def _fill_missing_languages(substrings: List[SubString]):
- for index, substr in enumerate(substrings):
- if substr.lang == "x":
- if index == 0:
- # For head substring, find right substring
- substrings[index].lang = _find_nearest_lang_with_direction(
- substrings, index, search_left=False
- )
- elif index == len(substrings) - 1:
- # For tail substring, find left substring
- substrings[index].lang = _find_nearest_lang_with_direction(
- substrings, index, search_left=True
- )
- else:
- # For body (middle) substring, find based on rule
- is_left = _get_find_direction(substrings, index)
- substrings[index].lang = _find_nearest_lang_with_direction(
- substrings, index, is_left
- )
- return substrings
-
-
-def _find_nearest_lang_with_direction(
- substrings: List[SubString], index: int, search_left: bool
-) -> str:
- if search_left:
- for i in range(1, len(substrings)):
- left_i_index = index - i
- if (
- left_i_index >= 0
- and substrings[left_i_index].lang != "x"
- and substrings[left_i_index].is_digit is False
- ):
- return substrings[left_i_index].lang
- else:
- for i in range(1, len(substrings)):
- right_i_index = index + i
- if (
- right_i_index < len(substrings)
- and substrings[right_i_index].lang != "x"
- and substrings[right_i_index].is_digit is False
- ):
- return substrings[right_i_index].lang
- return substrings[index].lang
-
-
-def _get_find_direction(substrings: List[SubString], index: int) -> bool:
- is_left = False
- if index == 0:
- is_left = False
- return is_left
- elif index == len(substrings) - 1:
- is_left = True
- return is_left
- left_block = substrings[index - 1]
- right_block = substrings[index + 1]
- if len(left_block.text) < len(right_block.text) or right_block.lang not in [
- "ja",
- "zh",
- ]:
- is_left = True
- else:
- is_left = False
- return is_left
-
+ is_need_merge_to_left = is_lang_x or is_cur_short_and_near_long
-def _merge_substrings(substrings: List[SubString]):
- smart_concat_result: List[SubString] = []
- lang = ""
- for block in substrings:
- cur_lang = block.lang
- if cur_lang != lang:
- smart_concat_result.append(block)
- else:
- smart_concat_result[-1].text += block.text
- smart_concat_result[-1].length += block.length
- lang = cur_lang
- return smart_concat_result
+ if is_need_merge_to_left:
+ substrings[-1].lang = self._get_nearest_lang_with_direction(
+ substrings, len(substrings) - 1, search_left=True
+ )
+ return substrings
+ # MARK: _fill_unknown_language
+ def _fill_unknown_language(
+ self,
+ substrings: List[SubString],
+ ):
+ for index, substr in enumerate(substrings):
+ if substr.lang == "x":
+ if index == 0:
+ # For head substring, find right substring
+ substrings[index].lang = self._get_nearest_lang_with_direction(
+ substrings, index, search_left=False
+ )
+ elif index == len(substrings) - 1:
+ # For tail substring, find left substring
+ substrings[index].lang = self._get_nearest_lang_with_direction(
+ substrings, index, search_left=True
+ )
+ else:
+ # For body (middle) substring, find based on rule
+ substrings[index].lang = self._get_nearest_lang_with_direction(
+ substrings, index, self._get_merge_direction(substrings, index)
+ )
+ return substrings
-def _merge_digit(substrings: List[SubString]) -> List[SubString]:
- new_substrings: List[SubString] = []
+ # MARK: _find_nearest_lang_with_direction
+ def _get_nearest_lang_with_direction(
+ self, substrings: List[SubString], index: int, search_left: bool
+ ) -> str:
+ if search_left:
+ for i in range(1, len(substrings)):
+ left_i_index = index - i
+ if (
+ left_i_index >= 0
+ and substrings[left_i_index].lang != "x"
+ and substrings[left_i_index].is_digit is False
+ ):
+ return substrings[left_i_index].lang
+ else:
+ for i in range(1, len(substrings)):
+ right_i_index = index + i
+ if (
+ right_i_index < len(substrings)
+ and substrings[right_i_index].lang != "x"
+ and substrings[right_i_index].is_digit is False
+ ):
+ return substrings[right_i_index].lang
+ return substrings[index].lang
+
+ # MARK: _get_merge_direction
+ def _get_merge_direction(self, substrings: List[SubString], index: int) -> bool:
+ is_left = False
+ if index == 0:
+ is_left = False
+ return is_left
+ elif index == len(substrings) - 1:
+ is_left = True
+ return is_left
+ left_block = substrings[index - 1]
+ right_block = substrings[index + 1]
+ if len(left_block.text) >= len(right_block.text):
+ is_left = True
+ else:
+ is_left = False
+ return is_left
- substr_len = len(substrings)
- if substr_len >= 3:
- for index in range(substr_len - 2):
- left_block = substrings[index]
- middle_block = substrings[index + 1]
- right_block = substrings[index + 2]
+ # MARK: _merge_substrings
+ def _merge_substrings(
+ self,
+ substrings: List[SubString],
+ ):
+ smart_concat_result: List[SubString] = []
+ lang = ""
+ for block in substrings:
+ cur_lang = block.lang
+ if cur_lang != lang:
+ smart_concat_result.append(block)
+ else:
+ smart_concat_result[-1].text += block.text
+ smart_concat_result[-1].length += block.length
+ lang = cur_lang
+ return smart_concat_result
+
+ # MARK: _merge_digit
+ def _merge_digit(
+ self,
+ substrings: List[SubString],
+ ) -> List[SubString]:
+ new_substrings: List[SubString] = []
+
+ substr_len = len(substrings)
+ if substr_len >= 3:
+ for index in range(substr_len - 2):
+ left_block = substrings[index]
+ middle_block = substrings[index + 1]
+ right_block = substrings[index + 2]
+
+ if (
+ left_block.lang == right_block.lang
+ and left_block.is_digit
+ and middle_block.is_punctuation
+ ):
+ substrings[index + 1].lang = left_block.lang
+ new_substrings = self._merge_substrings(substrings=substrings)
+ return new_substrings
+
+ # MARK: _merge_substring_across_digit
+ def _merge_substring_across_digit(
+ self,
+ substrings: List[SubString],
+ ) -> List[SubString]:
+ new_substrings: List[SubString] = []
+ left_digit_index = 0
+ is_left_has_digit = False
+
+ for index, substring in enumerate(substrings):
+ if substring.is_digit:
+ if index == 0:
+ is_left_has_digit = True
+ if new_substrings:
+ new_substrings[-1].text += substring.text
+ new_substrings[-1].length += substring.length
+ else:
+ if left_digit_index == 0:
+ left_digit_index = index
+ new_substrings.append(substring)
- if (
- left_block.lang == right_block.lang
- and left_block.is_digit
- and middle_block.is_punctuation
- ):
- substrings[index + 1].lang = left_block.lang
- new_substrings = _merge_substrings(substrings=substrings)
- return new_substrings
-
-
-def _merge_substring_across_digit(substrings: List[SubString]) -> List[SubString]:
- new_substrings: List[SubString] = []
- left_digit_index = 0
- is_left_has_digit = False
-
- for index, substring in enumerate(substrings):
- if substring.is_digit:
- if index == 0:
- is_left_has_digit = True
+ if is_left_has_digit:
+ left_digit_text = "".join(
+ substr.text for substr in substrings[0:left_digit_index]
+ )
if new_substrings:
- new_substrings[-1].text += substring.text
- new_substrings[-1].length += substring.length
- else:
- if left_digit_index == 0:
- left_digit_index = index
- new_substrings.append(substring)
-
- if is_left_has_digit:
- left_digit_text = "".join(
- substr.text for substr in substrings[0:left_digit_index]
- )
- if new_substrings:
- new_substrings[0].text = left_digit_text + new_substrings[0].text
- new_substrings[0].length = len(left_digit_text) + new_substrings[0].length
- else:
- new_substrings.append(
- SubString(
- is_digit=True,
- is_punctuation=False,
- text=left_digit_text,
- length=len(left_digit_text),
- lang="digit",
- index=0,
+ new_substrings[0].text = left_digit_text + new_substrings[0].text
+ new_substrings[0].length = (
+ len(left_digit_text) + new_substrings[0].length
)
- )
- new_substrings = _merge_substrings(substrings=new_substrings)
- return new_substrings
-
-
-def _merge_substrings_across_punctuation(
- substrings: List[SubString],
-) -> List[SubString]:
- new_substrings: List[SubString] = []
- lang = ""
- for substring in substrings:
- if substring.is_punctuation:
- if new_substrings and new_substrings[-1].lang == lang:
- new_substrings[-1].text += substring.text
- new_substrings[-1].length += substring.length
else:
- new_substrings.append(substring)
- else:
- if substring.lang != lang:
- new_substrings.append(substring)
+ new_substrings.append(
+ SubString(
+ is_digit=True,
+ is_punctuation=False,
+ text=left_digit_text,
+ length=len(left_digit_text),
+ lang="digit",
+ index=0,
+ )
+ )
+ new_substrings = self._merge_substrings(substrings=new_substrings)
+ return new_substrings
+
+ # MARK: _merge_substrings_across_punctuation
+ def _merge_substrings_across_punctuation(
+ self,
+ substrings: List[SubString],
+ ) -> List[SubString]:
+ new_substrings: List[SubString] = []
+ lang = ""
+ for substring in substrings:
+ if substring.is_punctuation:
+ if new_substrings and new_substrings[-1].lang == lang:
+ new_substrings[-1].text += substring.text
+ new_substrings[-1].length += substring.length
+ else:
+ new_substrings.append(substring)
else:
- new_substrings[-1].text += substring.text
- new_substrings[-1].length += substring.length
- lang = substring.lang if substring.lang != "punctuation" else lang
- return new_substrings
-
-
-# MARK: _get_languages
-def _get_languages(
- lang_text_list: List[SubString],
- lang_map: Dict = None,
- default_lang: str = DEFAULT_LANG,
-):
- if lang_map is None:
- lang_map = LANG_MAP
-
- for _, substr in enumerate(lang_text_list):
- if substr.is_punctuation or substr.is_digit:
- continue
- cur_lang = detect_lang_combined(substr.text)
- cur_lang = lang_map.get(cur_lang, default_lang)
-
- if cur_lang != "x":
- substr.lang = cur_lang
- return lang_text_list
-
-
-def _smart_concat_logic(
- lang_text_list: List[SubString], lang_map: Dict = None, default_lang: str = None
-):
-
- lang_text_list = _merge_middle_substr_to_two_side(lang_text_list)
- lang_text_list = _merge_substrings(lang_text_list)
- lang_text_list = _get_languages(
- lang_text_list=lang_text_list, lang_map=lang_map, default_lang="x"
- )
- lang_text_list = _merge_middle_substr_to_two_side(lang_text_list)
- lang_text_list = _fill_missing_languages(lang_text_list)
- lang_text_list = _merge_two_side_substr_to_near(lang_text_list)
- lang_text_list = _merge_substrings(lang_text_list)
- lang_text_list = _get_languages(
- lang_text_list=lang_text_list, lang_map=lang_map, default_lang=default_lang
- )
-
- return lang_text_list
+ if substring.lang != lang:
+ new_substrings.append(substring)
+ else:
+ new_substrings[-1].text += substring.text
+ new_substrings[-1].length += substring.length
+ lang = substring.lang if substring.lang != "punctuation" else lang
+ return new_substrings
+
+ # MARK: _get_languages
+ def _get_languages(
+ self,
+ lang_text_list: List[SubString],
+ lang_section_type: LangSectionType,
+ ):
+
+ if lang_section_type in [
+ LangSectionType.DIGIT,
+ LangSectionType.KO,
+ LangSectionType.PUNCTUATION,
+ ]:
+ return lang_text_list
+
+ for _, substr in enumerate(lang_text_list):
+ cur_lang = detect_lang_combined(
+ text=substr.text, lang_section_type=lang_section_type
+ )
+ cur_lang = self.lang_map.get(cur_lang, self.default_lang)
+
+ if cur_lang != "x":
+ substr.lang = cur_lang
+ return lang_text_list
+
+ def _smart_concat_logic(
+ self,
+ lang_text_list: List[SubString],
+ lang_section_type: LangSectionType,
+ ):
+
+ lang_text_list = self._merge_middle_substr_to_two_side(lang_text_list)
+ lang_text_list = self._merge_substrings(lang_text_list)
+ lang_text_list = self._get_languages(
+ lang_text_list=lang_text_list,
+ lang_section_type=lang_section_type,
+ )
+ lang_text_list = self._merge_middle_substr_to_two_side(lang_text_list)
+ lang_text_list = self._fill_unknown_language(lang_text_list)
+ lang_text_list = self._merge_side_substr_to_near(lang_text_list)
+ lang_text_list = self._merge_substrings(lang_text_list)
+ lang_text_list = self._get_languages(
+ lang_text_list=lang_text_list,
+ lang_section_type=lang_section_type,
+ )
+
+ return lang_text_list
diff --git a/split_lang/split/utils.py b/split_lang/split/utils.py
index 582ce20..3128381 100644
--- a/split_lang/split/utils.py
+++ b/split_lang/split/utils.py
@@ -1,35 +1,24 @@
import re
PUNCTUATION = r""",.;:!?,。!?;:、·([{<(【《〈「『“‘)]}>)】》〉」』”’"""
-DEFAULT_THRESHOLD = 1e-4
chinese_char_pattern = re.compile(r"[\u4e00-\u9fff]")
hangul_pattern = re.compile(r"[\uac00-\ud7af]")
-hiragana_pattern = re.compile(r"[\u3040-\u309f]")
-katakana_pattern = re.compile(r"[\u30a0-\u30ff]")
+hiragana_katakana_pattern = re.compile(r"[\u3040-\u30ff]")
+zh_ja_pattern = re.compile(r"[\u4e00-\u9fff\u3040-\u30ff]")
-def _contains_chinese_char(text: str):
+def contains_chinese_char(text: str) -> bool:
return bool(chinese_char_pattern.search(text))
-def contains_hangul(text: str):
+def contains_hangul(text: str) -> bool:
return bool(hangul_pattern.search(text))
-def _contains_hiragana(text: str):
- return bool(hiragana_pattern.search(text))
+def contains_ja(text: str) -> bool:
+ return bool(hiragana_katakana_pattern.search(text))
-def _contains_katakana(text: str):
- return bool(katakana_pattern.search(text))
-
-
-def contains_zh_ja(text):
- if (
- _contains_chinese_char(text)
- or _contains_hiragana(text)
- or _contains_katakana(text)
- ):
- return True
- return False
+def contains_zh_ja(text: str) -> bool:
+ return bool(zh_ja_pattern.search(text))
diff --git a/tests/__init__.py b/tests/__init__.py
index 3649fdf..7255755 100644
--- a/tests/__init__.py
+++ b/tests/__init__.py
@@ -1 +1 @@
-from .data.test_data import texts_zh_jp_ko_en, texts_de_fr_en, TestData
+from .data.test_data import texts_zh_jp_ko_en, texts_de_fr_en
diff --git a/tests/data/correct_split_merge_punc.txt b/tests/data/correct_split_merge_punc.txt
index 7191df1..c48ee01 100644
--- a/tests/data/correct_split_merge_punc.txt
+++ b/tests/data/correct_split_merge_punc.txt
@@ -1,40 +1,43 @@
-我是 |VGroupChatBot,|一个旨在支持多人通信的助手,通过可视化消息来帮助团队成员更好地交流。我可以帮助团队成员更好地整理和共享信息,特别是在讨论、会议和|Brainstorming|等情况下。你好我的名字是|西野くまです|my name is bob|很高兴认识你|どうぞよろしくお願いいたします「こんにちは」|是什么意思。
-我的名字是|西野くまです。|I am from Tokyo, |日本の首都。|今天的天气非常好
-你好,|今日はどこへ行きますか?
-你好|今日はどこへ行きますか?
+你喜欢看|アニメ|吗?
我的名字是|田中さんです。
-我喜欢吃寿司和拉面|おいしいです。
+我的名字是田中|さんです。
+日语真是|おもしろい|啊
+衬衫的价格是9.15便士。
+衬衫的价格是233亿元。
+衬衫的价格是233亿元人民币。
+你最近好吗|最近どうですか?
+你最近好吗最近|どうですか?
+你最近好吗、|最近どうですか?
+你最近好吗、最近|どうですか?
+你好|今日はどこへ行きますか?
+你好今日|はどこへ行きますか?
今天|の天気はとてもいいですね。
我在学习|日本語少し難しいです。
我在学习日本語|少し難しいです。
-日语真是|おもしろい|啊
-你喜欢看|アニメ|吗?
-我想去日本旅行、|特に京都に行きたいです。
-昨天|見た映画はとても感動的でした。|我朋友是日本人|彼はとても優しいです。
-昨天|見た|映画|はとても感動的でした。|我朋友是日本人|彼はとても優しいです。
+我喜欢吃寿司和拉面|おいしいです。
+你会说几种语言、|何ヶ国語話せますか?
我们一起去|カラオケ|吧、|楽しそうです。
+我想去日本旅行、|特に京都に行きたいです。
+我想去|日本旅行、特に京都に行きたいです。
我的家在北京、|でも、仕事で東京に住んでいます。
+我昨天看了一本书、|その本はとても面白かったです。
我在学做日本料理、|日本料理を作るのを習っています。
我在学做|日本料理、日本料理を作るのを習っています。
-你会说几种语言、|何ヶ国語話せますか?
-我昨天看了一本书、|その本はとても面白かったです。
-你最近好吗、|最近どうですか?
-你最近好吗、最近|どうですか?
-你最近好吗|最近どうですか?
-你最近好吗最近|どうですか?
+你最近好吗、|最近どうですか|요즘 어떻게 지내요?
+你最近好吗、最近|どうですか|요즘 어떻게 지내요?
+我们一起去逛街|와 쇼핑、|買い物に行きましょう。|쇼핑하러 가요。
我在学做日本料理|와 한국 요리、|日本料理を作るのを習っています。
+我在学做|日本料理|와 한국 요리、|日本料理を作るのを習っています。
你会说几种语言、|何ヶ国語話せますか?|몇 개 언어를 할 수 있어요?
我昨天看了一本书、|その本はとても面白かったです。|어제 책을 읽었는데, 정말 재미있었어요。
-我们一起去逛街|와 쇼핑、|買い物に行きましょう。|쇼핑하러 가요。
-你最近好吗、|最近どうですか?|요즘 어떻게 지내요?
-你最近好吗、最近|どうですか?|요즘 어떻게 지내요?
+昨天|見た映画はとても感動的でした。|我朋友是日本人|彼はとても優しいです。
Bonjour, |wie geht's dir |today?
Vielen Dank |merci beaucoup |for your help.
Ich bin müde |je suis fatigué |and I need some rest.
Ich mag dieses Buch |ce livre est intéressant |and it has a great story.
Ich mag dieses Buch, |ce livre est intéressant, |and it has a great story.
-衬衫的价格是9.15便士。
-衬衫的价格是233亿元。
-衬衫的价格是233亿元人民币。
The shirt is 9.15 dollars.
-The shirt is 233 dollars.
\ No newline at end of file
+The shirt is 233 dollars.
+我是 |VGroupChatBot,|一个旨在支持多人通信的助手,通过可视化消息来帮助团队成员更好地交流。我可以帮助团队成员更好地整理和共享信息,特别是在讨论、会议和|Brainstorming|等情况下。你好我的名字是|西野くまです|my name is bob|很高兴认识你|どうぞよろしくお願いいたします「こんにちは」|是什么意思。
+我的名字是|西野くまです。|I am from Tokyo, |日本の首都。|今天的天气非常好
+我给你送的|手紙|你读了吗?
\ No newline at end of file
diff --git a/tests/data/de_fr_en.json b/tests/data/de_fr_en.json
deleted file mode 100644
index 97d2fb9..0000000
--- a/tests/data/de_fr_en.json
+++ /dev/null
@@ -1,1620 +0,0 @@
-{
- "Ich liebe Paris, c'est une belle ville, and the food is amazing!": [
- {
- "text": "Ich liebe Paris",
- "substrings": [
- {
- "lang": "de",
- "text": "Ich liebe Paris",
- "index": 0,
- "length": 15,
- "is_punctuation": false,
- "is_digit": false
- }
- ],
- "is_punctuation": false,
- "is_digit": false
- },
- {
- "text": ", ",
- "substrings": [
- {
- "lang": "punctuation",
- "text": ", ",
- "index": 15,
- "length": 2,
- "is_punctuation": true,
- "is_digit": false
- }
- ],
- "is_punctuation": true,
- "is_digit": false
- },
- {
- "text": "c'est une belle ville",
- "substrings": [
- {
- "lang": "fr",
- "text": "c'est une belle ville",
- "index": 17,
- "length": 21,
- "is_punctuation": false,
- "is_digit": false
- }
- ],
- "is_punctuation": false,
- "is_digit": false
- },
- {
- "text": ", ",
- "substrings": [
- {
- "lang": "punctuation",
- "text": ", ",
- "index": 38,
- "length": 2,
- "is_punctuation": true,
- "is_digit": false
- }
- ],
- "is_punctuation": true,
- "is_digit": false
- },
- {
- "text": "and the food is amazing",
- "substrings": [
- {
- "lang": "en",
- "text": "and the food is amazing",
- "index": 40,
- "length": 23,
- "is_punctuation": false,
- "is_digit": false
- }
- ],
- "is_punctuation": false,
- "is_digit": false
- },
- {
- "text": "!",
- "substrings": [
- {
- "lang": "punctuation",
- "text": "!",
- "index": 63,
- "length": 1,
- "is_punctuation": true,
- "is_digit": false
- }
- ],
- "is_punctuation": true,
- "is_digit": false
- }
- ],
- "Berlin ist wunderbar, je veux y retourner, and explore more.": [
- {
- "text": "Berlin ist wunderbar",
- "substrings": [
- {
- "lang": "de",
- "text": "Berlin ist wunderbar",
- "index": 0,
- "length": 20,
- "is_punctuation": false,
- "is_digit": false
- }
- ],
- "is_punctuation": false,
- "is_digit": false
- },
- {
- "text": ", ",
- "substrings": [
- {
- "lang": "punctuation",
- "text": ", ",
- "index": 20,
- "length": 2,
- "is_punctuation": true,
- "is_digit": false
- }
- ],
- "is_punctuation": true,
- "is_digit": false
- },
- {
- "text": "je veux y retourner",
- "substrings": [
- {
- "lang": "hr",
- "text": "je ",
- "index": 22,
- "length": 3,
- "is_punctuation": false,
- "is_digit": false
- },
- {
- "lang": "fr",
- "text": "veux ",
- "index": 25,
- "length": 5,
- "is_punctuation": false,
- "is_digit": false
- },
- {
- "lang": "cy",
- "text": "y ",
- "index": 30,
- "length": 2,
- "is_punctuation": false,
- "is_digit": false
- },
- {
- "lang": "fr",
- "text": "retourner",
- "index": 32,
- "length": 9,
- "is_punctuation": false,
- "is_digit": false
- }
- ],
- "is_punctuation": false,
- "is_digit": false
- },
- {
- "text": ", ",
- "substrings": [
- {
- "lang": "punctuation",
- "text": ", ",
- "index": 41,
- "length": 2,
- "is_punctuation": true,
- "is_digit": false
- }
- ],
- "is_punctuation": true,
- "is_digit": false
- },
- {
- "text": "and explore more",
- "substrings": [
- {
- "lang": "en",
- "text": "and ",
- "index": 43,
- "length": 4,
- "is_punctuation": false,
- "is_digit": false
- },
- {
- "lang": "ro",
- "text": "explore ",
- "index": 47,
- "length": 8,
- "is_punctuation": false,
- "is_digit": false
- },
- {
- "lang": "en",
- "text": "more",
- "index": 55,
- "length": 4,
- "is_punctuation": false,
- "is_digit": false
- }
- ],
- "is_punctuation": false,
- "is_digit": false
- },
- {
- "text": ".",
- "substrings": [
- {
- "lang": "punctuation",
- "text": ".",
- "index": 59,
- "length": 1,
- "is_punctuation": true,
- "is_digit": false
- }
- ],
- "is_punctuation": true,
- "is_digit": false
- }
- ],
- "Bonjour, wie geht's dir today?": [
- {
- "text": "Bonjour",
- "substrings": [
- {
- "lang": "fr",
- "text": "Bonjour",
- "index": 0,
- "length": 7,
- "is_punctuation": false,
- "is_digit": false
- }
- ],
- "is_punctuation": false,
- "is_digit": false
- },
- {
- "text": ", ",
- "substrings": [
- {
- "lang": "punctuation",
- "text": ", ",
- "index": 7,
- "length": 2,
- "is_punctuation": true,
- "is_digit": false
- }
- ],
- "is_punctuation": true,
- "is_digit": false
- },
- {
- "text": "wie geht's dir today",
- "substrings": [
- {
- "lang": "de",
- "text": "wie geht's ",
- "index": 9,
- "length": 11,
- "is_punctuation": false,
- "is_digit": false
- },
- {
- "lang": "tr",
- "text": "dir ",
- "index": 20,
- "length": 4,
- "is_punctuation": false,
- "is_digit": false
- },
- {
- "lang": "en",
- "text": "today",
- "index": 24,
- "length": 5,
- "is_punctuation": false,
- "is_digit": false
- }
- ],
- "is_punctuation": false,
- "is_digit": false
- },
- {
- "text": "?",
- "substrings": [
- {
- "lang": "punctuation",
- "text": "?",
- "index": 29,
- "length": 1,
- "is_punctuation": true,
- "is_digit": false
- }
- ],
- "is_punctuation": true,
- "is_digit": false
- }
- ],
- "Die Musik hier ist fantastisch, la musique est superbe, and I enjoy it a lot.": [
- {
- "text": "Die Musik hier ist fantastisch",
- "substrings": [
- {
- "lang": "de",
- "text": "Die Musik hier ist fantastisch",
- "index": 0,
- "length": 30,
- "is_punctuation": false,
- "is_digit": false
- }
- ],
- "is_punctuation": false,
- "is_digit": false
- },
- {
- "text": ", ",
- "substrings": [
- {
- "lang": "punctuation",
- "text": ", ",
- "index": 30,
- "length": 2,
- "is_punctuation": true,
- "is_digit": false
- }
- ],
- "is_punctuation": true,
- "is_digit": false
- },
- {
- "text": "la musique est superbe",
- "substrings": [
- {
- "lang": "es",
- "text": "la ",
- "index": 32,
- "length": 3,
- "is_punctuation": false,
- "is_digit": false
- },
- {
- "lang": "fr",
- "text": "musique est superbe",
- "index": 35,
- "length": 19,
- "is_punctuation": false,
- "is_digit": false
- }
- ],
- "is_punctuation": false,
- "is_digit": false
- },
- {
- "text": ", ",
- "substrings": [
- {
- "lang": "punctuation",
- "text": ", ",
- "index": 54,
- "length": 2,
- "is_punctuation": true,
- "is_digit": false
- }
- ],
- "is_punctuation": true,
- "is_digit": false
- },
- {
- "text": "and I enjoy it a lot",
- "substrings": [
- {
- "lang": "en",
- "text": "and ",
- "index": 56,
- "length": 4,
- "is_punctuation": false,
- "is_digit": false
- },
- {
- "lang": "it",
- "text": "I ",
- "index": 60,
- "length": 2,
- "is_punctuation": false,
- "is_digit": false
- },
- {
- "lang": "en",
- "text": "enjoy it a lot",
- "index": 62,
- "length": 14,
- "is_punctuation": false,
- "is_digit": false
- }
- ],
- "is_punctuation": false,
- "is_digit": false
- },
- {
- "text": ".",
- "substrings": [
- {
- "lang": "punctuation",
- "text": ".",
- "index": 76,
- "length": 1,
- "is_punctuation": true,
- "is_digit": false
- }
- ],
- "is_punctuation": true,
- "is_digit": false
- }
- ],
- "Guten Morgen, je t'aime, have a great day!": [
- {
- "text": "Guten Morgen",
- "substrings": [
- {
- "lang": "de",
- "text": "Guten Morgen",
- "index": 0,
- "length": 12,
- "is_punctuation": false,
- "is_digit": false
- }
- ],
- "is_punctuation": false,
- "is_digit": false
- },
- {
- "text": ", ",
- "substrings": [
- {
- "lang": "punctuation",
- "text": ", ",
- "index": 12,
- "length": 2,
- "is_punctuation": true,
- "is_digit": false
- }
- ],
- "is_punctuation": true,
- "is_digit": false
- },
- {
- "text": "je t'aime",
- "substrings": [
- {
- "lang": "hr",
- "text": "je ",
- "index": 14,
- "length": 3,
- "is_punctuation": false,
- "is_digit": false
- },
- {
- "lang": "fr",
- "text": "t'aime",
- "index": 17,
- "length": 6,
- "is_punctuation": false,
- "is_digit": false
- }
- ],
- "is_punctuation": false,
- "is_digit": false
- },
- {
- "text": ", ",
- "substrings": [
- {
- "lang": "punctuation",
- "text": ", ",
- "index": 23,
- "length": 2,
- "is_punctuation": true,
- "is_digit": false
- }
- ],
- "is_punctuation": true,
- "is_digit": false
- },
- {
- "text": "have a great day",
- "substrings": [
- {
- "lang": "en",
- "text": "have ",
- "index": 25,
- "length": 5,
- "is_punctuation": false,
- "is_digit": false
- },
- {
- "lang": "hu",
- "text": "a ",
- "index": 30,
- "length": 2,
- "is_punctuation": false,
- "is_digit": false
- },
- {
- "lang": "en",
- "text": "great ",
- "index": 32,
- "length": 6,
- "is_punctuation": false,
- "is_digit": false
- },
- {
- "lang": "so",
- "text": "day",
- "index": 38,
- "length": 3,
- "is_punctuation": false,
- "is_digit": false
- }
- ],
- "is_punctuation": false,
- "is_digit": false
- },
- {
- "text": "!",
- "substrings": [
- {
- "lang": "punctuation",
- "text": "!",
- "index": 41,
- "length": 1,
- "is_punctuation": true,
- "is_digit": false
- }
- ],
- "is_punctuation": true,
- "is_digit": false
- }
- ],
- "Das Wetter ist heute schön, il fait beau aujourd'hui, and it's perfect for a walk.": [
- {
- "text": "Das Wetter ist heute schön",
- "substrings": [
- {
- "lang": "de",
- "text": "Das Wetter ist heute schön",
- "index": 0,
- "length": 26,
- "is_punctuation": false,
- "is_digit": false
- }
- ],
- "is_punctuation": false,
- "is_digit": false
- },
- {
- "text": ", ",
- "substrings": [
- {
- "lang": "punctuation",
- "text": ", ",
- "index": 26,
- "length": 2,
- "is_punctuation": true,
- "is_digit": false
- }
- ],
- "is_punctuation": true,
- "is_digit": false
- },
- {
- "text": "il fait beau aujourd'hui",
- "substrings": [
- {
- "lang": "it",
- "text": "il ",
- "index": 28,
- "length": 3,
- "is_punctuation": false,
- "is_digit": false
- },
- {
- "lang": "fr",
- "text": "fait beau aujourd'",
- "index": 31,
- "length": 18,
- "is_punctuation": false,
- "is_digit": false
- },
- {
- "lang": "sw",
- "text": "hui",
- "index": 49,
- "length": 3,
- "is_punctuation": false,
- "is_digit": false
- }
- ],
- "is_punctuation": false,
- "is_digit": false
- },
- {
- "text": ", ",
- "substrings": [
- {
- "lang": "punctuation",
- "text": ", ",
- "index": 52,
- "length": 2,
- "is_punctuation": true,
- "is_digit": false
- }
- ],
- "is_punctuation": true,
- "is_digit": false
- },
- {
- "text": "and it's perfect for a walk",
- "substrings": [
- {
- "lang": "en",
- "text": "and it's perfect for a walk",
- "index": 54,
- "length": 27,
- "is_punctuation": false,
- "is_digit": false
- }
- ],
- "is_punctuation": false,
- "is_digit": false
- },
- {
- "text": ".",
- "substrings": [
- {
- "lang": "punctuation",
- "text": ".",
- "index": 81,
- "length": 1,
- "is_punctuation": true,
- "is_digit": false
- }
- ],
- "is_punctuation": true,
- "is_digit": false
- }
- ],
- "Ich mag dieses Buch, ce livre est intéressant, and it has a great story.": [
- {
- "text": "Ich mag dieses Buch",
- "substrings": [
- {
- "lang": "de",
- "text": "Ich ",
- "index": 0,
- "length": 4,
- "is_punctuation": false,
- "is_digit": false
- },
- {
- "lang": "tl",
- "text": "mag ",
- "index": 4,
- "length": 4,
- "is_punctuation": false,
- "is_digit": false
- },
- {
- "lang": "de",
- "text": "dieses Buch",
- "index": 8,
- "length": 11,
- "is_punctuation": false,
- "is_digit": false
- }
- ],
- "is_punctuation": false,
- "is_digit": false
- },
- {
- "text": ", ",
- "substrings": [
- {
- "lang": "punctuation",
- "text": ", ",
- "index": 19,
- "length": 2,
- "is_punctuation": true,
- "is_digit": false
- }
- ],
- "is_punctuation": true,
- "is_digit": false
- },
- {
- "text": "ce livre est intéressant",
- "substrings": [
- {
- "lang": "fr",
- "text": "ce livre est intéressant",
- "index": 21,
- "length": 24,
- "is_punctuation": false,
- "is_digit": false
- }
- ],
- "is_punctuation": false,
- "is_digit": false
- },
- {
- "text": ", ",
- "substrings": [
- {
- "lang": "punctuation",
- "text": ", ",
- "index": 45,
- "length": 2,
- "is_punctuation": true,
- "is_digit": false
- }
- ],
- "is_punctuation": true,
- "is_digit": false
- },
- {
- "text": "and it has a great story",
- "substrings": [
- {
- "lang": "en",
- "text": "and ",
- "index": 47,
- "length": 4,
- "is_punctuation": false,
- "is_digit": false
- },
- {
- "lang": "sq",
- "text": "it ",
- "index": 51,
- "length": 3,
- "is_punctuation": false,
- "is_digit": false
- },
- {
- "lang": "en",
- "text": "has ",
- "index": 54,
- "length": 4,
- "is_punctuation": false,
- "is_digit": false
- },
- {
- "lang": "hu",
- "text": "a ",
- "index": 58,
- "length": 2,
- "is_punctuation": false,
- "is_digit": false
- },
- {
- "lang": "en",
- "text": "great story",
- "index": 60,
- "length": 11,
- "is_punctuation": false,
- "is_digit": false
- }
- ],
- "is_punctuation": false,
- "is_digit": false
- },
- {
- "text": ".",
- "substrings": [
- {
- "lang": "punctuation",
- "text": ".",
- "index": 71,
- "length": 1,
- "is_punctuation": true,
- "is_digit": false
- }
- ],
- "is_punctuation": true,
- "is_digit": false
- }
- ],
- "Vielen Dank, merci beaucoup, for your help.": [
- {
- "text": "Vielen Dank",
- "substrings": [
- {
- "lang": "de",
- "text": "Vielen ",
- "index": 0,
- "length": 7,
- "is_punctuation": false,
- "is_digit": false
- },
- {
- "lang": "so",
- "text": "Dank",
- "index": 7,
- "length": 4,
- "is_punctuation": false,
- "is_digit": false
- }
- ],
- "is_punctuation": false,
- "is_digit": false
- },
- {
- "text": ", ",
- "substrings": [
- {
- "lang": "punctuation",
- "text": ", ",
- "index": 11,
- "length": 2,
- "is_punctuation": true,
- "is_digit": false
- }
- ],
- "is_punctuation": true,
- "is_digit": false
- },
- {
- "text": "merci beaucoup",
- "substrings": [
- {
- "lang": "fr",
- "text": "merci beaucoup",
- "index": 13,
- "length": 14,
- "is_punctuation": false,
- "is_digit": false
- }
- ],
- "is_punctuation": false,
- "is_digit": false
- },
- {
- "text": ", ",
- "substrings": [
- {
- "lang": "punctuation",
- "text": ", ",
- "index": 27,
- "length": 2,
- "is_punctuation": true,
- "is_digit": false
- }
- ],
- "is_punctuation": true,
- "is_digit": false
- },
- {
- "text": "for your help",
- "substrings": [
- {
- "lang": "en",
- "text": "for your help",
- "index": 29,
- "length": 13,
- "is_punctuation": false,
- "is_digit": false
- }
- ],
- "is_punctuation": false,
- "is_digit": false
- },
- {
- "text": ".",
- "substrings": [
- {
- "lang": "punctuation",
- "text": ".",
- "index": 42,
- "length": 1,
- "is_punctuation": true,
- "is_digit": false
- }
- ],
- "is_punctuation": true,
- "is_digit": false
- }
- ],
- "Wir reisen nach Deutschland, nous voyageons en Allemagne, and we are excited.": [
- {
- "text": "Wir reisen nach Deutschland",
- "substrings": [
- {
- "lang": "de",
- "text": "Wir reisen nach Deutschland",
- "index": 0,
- "length": 27,
- "is_punctuation": false,
- "is_digit": false
- }
- ],
- "is_punctuation": false,
- "is_digit": false
- },
- {
- "text": ", ",
- "substrings": [
- {
- "lang": "punctuation",
- "text": ", ",
- "index": 27,
- "length": 2,
- "is_punctuation": true,
- "is_digit": false
- }
- ],
- "is_punctuation": true,
- "is_digit": false
- },
- {
- "text": "nous voyageons en Allemagne",
- "substrings": [
- {
- "lang": "fr",
- "text": "nous voyageons en Allemagne",
- "index": 29,
- "length": 27,
- "is_punctuation": false,
- "is_digit": false
- }
- ],
- "is_punctuation": false,
- "is_digit": false
- },
- {
- "text": ", ",
- "substrings": [
- {
- "lang": "punctuation",
- "text": ", ",
- "index": 56,
- "length": 2,
- "is_punctuation": true,
- "is_digit": false
- }
- ],
- "is_punctuation": true,
- "is_digit": false
- },
- {
- "text": "and we are excited",
- "substrings": [
- {
- "lang": "en",
- "text": "and we are excited",
- "index": 58,
- "length": 18,
- "is_punctuation": false,
- "is_digit": false
- }
- ],
- "is_punctuation": false,
- "is_digit": false
- },
- {
- "text": ".",
- "substrings": [
- {
- "lang": "punctuation",
- "text": ".",
- "index": 76,
- "length": 1,
- "is_punctuation": true,
- "is_digit": false
- }
- ],
- "is_punctuation": true,
- "is_digit": false
- }
- ],
- "Ich bin müde, je suis fatigué, and I need some rest.": [
- {
- "text": "Ich bin müde",
- "substrings": [
- {
- "lang": "de",
- "text": "Ich bin müde",
- "index": 0,
- "length": 12,
- "is_punctuation": false,
- "is_digit": false
- }
- ],
- "is_punctuation": false,
- "is_digit": false
- },
- {
- "text": ", ",
- "substrings": [
- {
- "lang": "punctuation",
- "text": ", ",
- "index": 12,
- "length": 2,
- "is_punctuation": true,
- "is_digit": false
- }
- ],
- "is_punctuation": true,
- "is_digit": false
- },
- {
- "text": "je suis fatigué",
- "substrings": [
- {
- "lang": "hr",
- "text": "je ",
- "index": 14,
- "length": 3,
- "is_punctuation": false,
- "is_digit": false
- },
- {
- "lang": "fr",
- "text": "suis fatigué",
- "index": 17,
- "length": 12,
- "is_punctuation": false,
- "is_digit": false
- }
- ],
- "is_punctuation": false,
- "is_digit": false
- },
- {
- "text": ", ",
- "substrings": [
- {
- "lang": "punctuation",
- "text": ", ",
- "index": 29,
- "length": 2,
- "is_punctuation": true,
- "is_digit": false
- }
- ],
- "is_punctuation": true,
- "is_digit": false
- },
- {
- "text": "and I need some rest",
- "substrings": [
- {
- "lang": "en",
- "text": "and ",
- "index": 31,
- "length": 4,
- "is_punctuation": false,
- "is_digit": false
- },
- {
- "lang": "it",
- "text": "I ",
- "index": 35,
- "length": 2,
- "is_punctuation": false,
- "is_digit": false
- },
- {
- "lang": "en",
- "text": "need some rest",
- "index": 37,
- "length": 14,
- "is_punctuation": false,
- "is_digit": false
- }
- ],
- "is_punctuation": false,
- "is_digit": false
- },
- {
- "text": ".",
- "substrings": [
- {
- "lang": "punctuation",
- "text": ".",
- "index": 51,
- "length": 1,
- "is_punctuation": true,
- "is_digit": false
- }
- ],
- "is_punctuation": true,
- "is_digit": false
- }
- ],
- "Ich liebe Paris c'est une belle ville and the food is amazing!": [
- {
- "text": "Ich liebe Paris c'est une belle ville and the food is amazing",
- "substrings": [
- {
- "lang": "de",
- "text": "Ich liebe Paris ",
- "index": 0,
- "length": 16,
- "is_punctuation": false,
- "is_digit": false
- },
- {
- "lang": "fr",
- "text": "c'est une belle ville and the food is ",
- "index": 16,
- "length": 38,
- "is_punctuation": false,
- "is_digit": false
- },
- {
- "lang": "en",
- "text": "amazing",
- "index": 54,
- "length": 7,
- "is_punctuation": false,
- "is_digit": false
- }
- ],
- "is_punctuation": false,
- "is_digit": false
- },
- {
- "text": "!",
- "substrings": [
- {
- "lang": "punctuation",
- "text": "!",
- "index": 61,
- "length": 1,
- "is_punctuation": true,
- "is_digit": false
- }
- ],
- "is_punctuation": true,
- "is_digit": false
- }
- ],
- "Berlin ist wunderbar je veux y retourner and explore more.": [
- {
- "text": "Berlin ist wunderbar je veux y retourner and explore more",
- "substrings": [
- {
- "lang": "de",
- "text": "Berlin ist wunderbar ",
- "index": 0,
- "length": 21,
- "is_punctuation": false,
- "is_digit": false
- },
- {
- "lang": "fr",
- "text": "je veux y ",
- "index": 21,
- "length": 10,
- "is_punctuation": false,
- "is_digit": false
- },
- {
- "lang": "en",
- "text": "retourner and explore more",
- "index": 31,
- "length": 26,
- "is_punctuation": false,
- "is_digit": false
- }
- ],
- "is_punctuation": false,
- "is_digit": false
- },
- {
- "text": ".",
- "substrings": [
- {
- "lang": "punctuation",
- "text": ".",
- "index": 57,
- "length": 1,
- "is_punctuation": true,
- "is_digit": false
- }
- ],
- "is_punctuation": true,
- "is_digit": false
- }
- ],
- "Bonjour wie geht's dir today?": [
- {
- "text": "Bonjour wie geht's dir today",
- "substrings": [
- {
- "lang": "fr",
- "text": "Bonjour ",
- "index": 0,
- "length": 8,
- "is_punctuation": false,
- "is_digit": false
- },
- {
- "lang": "de",
- "text": "wie geht's dir ",
- "index": 8,
- "length": 15,
- "is_punctuation": false,
- "is_digit": false
- },
- {
- "lang": "en",
- "text": "today",
- "index": 23,
- "length": 5,
- "is_punctuation": false,
- "is_digit": false
- }
- ],
- "is_punctuation": false,
- "is_digit": false
- },
- {
- "text": "?",
- "substrings": [
- {
- "lang": "punctuation",
- "text": "?",
- "index": 28,
- "length": 1,
- "is_punctuation": true,
- "is_digit": false
- }
- ],
- "is_punctuation": true,
- "is_digit": false
- }
- ],
- "Die Musik hier ist fantastisch la musique est superbe and I enjoy it a lot.": [
- {
- "text": "Die Musik hier ist fantastisch la musique est superbe and I enjoy it a lot",
- "substrings": [
- {
- "lang": "de",
- "text": "Die Musik hier ist fantastisch ",
- "index": 0,
- "length": 31,
- "is_punctuation": false,
- "is_digit": false
- },
- {
- "lang": "fr",
- "text": "la musique est superbe ",
- "index": 31,
- "length": 23,
- "is_punctuation": false,
- "is_digit": false
- },
- {
- "lang": "en",
- "text": "and ",
- "index": 54,
- "length": 4,
- "is_punctuation": false,
- "is_digit": false
- },
- {
- "lang": "it",
- "text": "I ",
- "index": 58,
- "length": 2,
- "is_punctuation": false,
- "is_digit": false
- },
- {
- "lang": "en",
- "text": "enjoy it a lot",
- "index": 60,
- "length": 14,
- "is_punctuation": false,
- "is_digit": false
- }
- ],
- "is_punctuation": false,
- "is_digit": false
- },
- {
- "text": ".",
- "substrings": [
- {
- "lang": "punctuation",
- "text": ".",
- "index": 74,
- "length": 1,
- "is_punctuation": true,
- "is_digit": false
- }
- ],
- "is_punctuation": true,
- "is_digit": false
- }
- ],
- "Guten Morgen je t'aime have a great day!": [
- {
- "text": "Guten Morgen je t'aime have a great day",
- "substrings": [
- {
- "lang": "de",
- "text": "Guten Morgen ",
- "index": 0,
- "length": 13,
- "is_punctuation": false,
- "is_digit": false
- },
- {
- "lang": "fr",
- "text": "je t'aime ",
- "index": 13,
- "length": 10,
- "is_punctuation": false,
- "is_digit": false
- },
- {
- "lang": "en",
- "text": "have a great ",
- "index": 23,
- "length": 13,
- "is_punctuation": false,
- "is_digit": false
- },
- {
- "lang": "so",
- "text": "day",
- "index": 36,
- "length": 3,
- "is_punctuation": false,
- "is_digit": false
- }
- ],
- "is_punctuation": false,
- "is_digit": false
- },
- {
- "text": "!",
- "substrings": [
- {
- "lang": "punctuation",
- "text": "!",
- "index": 39,
- "length": 1,
- "is_punctuation": true,
- "is_digit": false
- }
- ],
- "is_punctuation": true,
- "is_digit": false
- }
- ],
- "Das Wetter ist heute schön il fait beau aujourd'hui and it's perfect for a walk.": [
- {
- "text": "Das Wetter ist heute schön il fait beau aujourd'hui and it's perfect for a walk",
- "substrings": [
- {
- "lang": "de",
- "text": "Das Wetter ist heute schön ",
- "index": 0,
- "length": 27,
- "is_punctuation": false,
- "is_digit": false
- },
- {
- "lang": "fr",
- "text": "il fait beau aujourd'",
- "index": 27,
- "length": 21,
- "is_punctuation": false,
- "is_digit": false
- },
- {
- "lang": "de",
- "text": "hui ",
- "index": 48,
- "length": 4,
- "is_punctuation": false,
- "is_digit": false
- },
- {
- "lang": "en",
- "text": "and it's perfect for a walk",
- "index": 52,
- "length": 27,
- "is_punctuation": false,
- "is_digit": false
- }
- ],
- "is_punctuation": false,
- "is_digit": false
- },
- {
- "text": ".",
- "substrings": [
- {
- "lang": "punctuation",
- "text": ".",
- "index": 79,
- "length": 1,
- "is_punctuation": true,
- "is_digit": false
- }
- ],
- "is_punctuation": true,
- "is_digit": false
- }
- ],
- "Ich mag dieses Buch ce livre est intéressant and it has a great story.": [
- {
- "text": "Ich mag dieses Buch ce livre est intéressant and it has a great story",
- "substrings": [
- {
- "lang": "de",
- "text": "Ich mag dieses Buch ",
- "index": 0,
- "length": 20,
- "is_punctuation": false,
- "is_digit": false
- },
- {
- "lang": "en",
- "text": "ce livre est intéressant and it has a great story",
- "index": 20,
- "length": 49,
- "is_punctuation": false,
- "is_digit": false
- }
- ],
- "is_punctuation": false,
- "is_digit": false
- },
- {
- "text": ".",
- "substrings": [
- {
- "lang": "punctuation",
- "text": ".",
- "index": 69,
- "length": 1,
- "is_punctuation": true,
- "is_digit": false
- }
- ],
- "is_punctuation": true,
- "is_digit": false
- }
- ],
- "Vielen Dank merci beaucoup for your help.": [
- {
- "text": "Vielen Dank merci beaucoup for your help",
- "substrings": [
- {
- "lang": "de",
- "text": "Vielen ",
- "index": 0,
- "length": 7,
- "is_punctuation": false,
- "is_digit": false
- },
- {
- "lang": "af",
- "text": "Dank ",
- "index": 7,
- "length": 5,
- "is_punctuation": false,
- "is_digit": false
- },
- {
- "lang": "fr",
- "text": "merci beaucoup for your help",
- "index": 12,
- "length": 28,
- "is_punctuation": false,
- "is_digit": false
- }
- ],
- "is_punctuation": false,
- "is_digit": false
- },
- {
- "text": ".",
- "substrings": [
- {
- "lang": "punctuation",
- "text": ".",
- "index": 40,
- "length": 1,
- "is_punctuation": true,
- "is_digit": false
- }
- ],
- "is_punctuation": true,
- "is_digit": false
- }
- ],
- "Wir reisen nach Deutschland nous voyageons en Allemagne and we are excited.": [
- {
- "text": "Wir reisen nach Deutschland nous voyageons en Allemagne and we are excited",
- "substrings": [
- {
- "lang": "de",
- "text": "Wir reisen nach Deutschland ",
- "index": 0,
- "length": 28,
- "is_punctuation": false,
- "is_digit": false
- },
- {
- "lang": "fr",
- "text": "nous voyageons en Allemagne ",
- "index": 28,
- "length": 28,
- "is_punctuation": false,
- "is_digit": false
- },
- {
- "lang": "en",
- "text": "and we are excited",
- "index": 56,
- "length": 18,
- "is_punctuation": false,
- "is_digit": false
- }
- ],
- "is_punctuation": false,
- "is_digit": false
- },
- {
- "text": ".",
- "substrings": [
- {
- "lang": "punctuation",
- "text": ".",
- "index": 74,
- "length": 1,
- "is_punctuation": true,
- "is_digit": false
- }
- ],
- "is_punctuation": true,
- "is_digit": false
- }
- ],
- "Ich bin müde je suis fatigué and I need some rest.": [
- {
- "text": "Ich bin müde je suis fatigué and I need some rest",
- "substrings": [
- {
- "lang": "de",
- "text": "Ich bin müde ",
- "index": 0,
- "length": 13,
- "is_punctuation": false,
- "is_digit": false
- },
- {
- "lang": "fr",
- "text": "je suis fatigué ",
- "index": 13,
- "length": 16,
- "is_punctuation": false,
- "is_digit": false
- },
- {
- "lang": "en",
- "text": "and ",
- "index": 29,
- "length": 4,
- "is_punctuation": false,
- "is_digit": false
- },
- {
- "lang": "it",
- "text": "I ",
- "index": 33,
- "length": 2,
- "is_punctuation": false,
- "is_digit": false
- },
- {
- "lang": "en",
- "text": "need some rest",
- "index": 35,
- "length": 14,
- "is_punctuation": false,
- "is_digit": false
- }
- ],
- "is_punctuation": false,
- "is_digit": false
- },
- {
- "text": ".",
- "substrings": [
- {
- "lang": "punctuation",
- "text": ".",
- "index": 49,
- "length": 1,
- "is_punctuation": true,
- "is_digit": false
- }
- ],
- "is_punctuation": true,
- "is_digit": false
- }
- ]
-}
\ No newline at end of file
diff --git a/tests/data/generate_test_json.py b/tests/data/generate_test_json.py
deleted file mode 100644
index fbd0594..0000000
--- a/tests/data/generate_test_json.py
+++ /dev/null
@@ -1,83 +0,0 @@
-import json
-import os
-from typing import Dict, List
-
-from split_lang import split
-from split_lang.split.splitter import SubStringSection, TextSplitter
-from split_lang.split.utils import DEFAULT_THRESHOLD
-from tests.data.test_data import TestData, texts_de_fr_en, texts_zh_jp_ko_en
-from tests.test_config import TEST_DATA_FOLDER
-
-
-def generate_test_data(data: TestData):
- # Create the tests/data directory if it doesn't exist
- os.makedirs(TEST_DATA_FOLDER, exist_ok=True)
-
- result: List[Dict[str, List[SubStringSection]]] = []
- for text in data.texts:
- section = split(
- text=text,
- threshold=data.threshold,
- splitter=data.splitter,
- lang_map=data.lang_map,
- default_lang=data.default_lang,
- verbose=True,
- )
- result.append({text: section})
-
- # Convert result to JSON serializable format using pydantic
- json_result = result
-
- # Write the result to a JSON file using pydantic's .json() method
- with open(
- os.path.join(TEST_DATA_FOLDER, f"{data.filename}.json"), "w", encoding="utf-8"
- ) as f:
- f.write(
- json.dumps(
- {
- list(sections.keys())[0]: [
- section.model_dump()
- for section in sections[list(sections.keys())[0]]
- ]
- for sections in json_result
- },
- ensure_ascii=False,
- indent=4,
- )
- )
-
-
-def main():
- splitter = TextSplitter()
- zh_jp_ko_en_lang_map = {
- "zh": "zh",
- "zh-cn": "zh",
- "zh-tw": "x",
- "ko": "ko",
- "ja": "ja",
- }
-
- data = TestData(
- filename="zh_jp_ko_en",
- texts=texts_zh_jp_ko_en,
- threshold=DEFAULT_THRESHOLD,
- splitter=splitter,
- lang_map=zh_jp_ko_en_lang_map,
- default_lang="en",
- )
- generate_test_data(data=data)
-
- # data = TestData(
- # filename="de_fr_en",
- # texts=texts_de_fr_en,
- # threshold=4.9e-4,
- # splitter=splitter,
- # lang_map=None,
- # default_lang="x",
- # )
- # generate_test_data(data=data)
- return
-
-
-if __name__ == "__main__":
- main()
diff --git a/tests/data/test_data.py b/tests/data/test_data.py
index 4ec89fd..9435e3f 100644
--- a/tests/data/test_data.py
+++ b/tests/data/test_data.py
@@ -1,21 +1,3 @@
-from pydantic import BaseModel
-from typing import Dict, List, Optional
-from split_lang.split.splitter import TextSplitter
-from split_lang.detect_lang.detector import DEFAULT_LANG
-
-
-class TestData(BaseModel):
- filename: str
- texts: List[str]
- threshold: float
- splitter: TextSplitter
- lang_map: Optional[Dict]
- default_lang: str = DEFAULT_LANG
-
- class Config:
- arbitrary_types_allowed = True
-
-
texts_with_digit = [
"你喜欢看アニメ吗?",
"衬衫的价格是9.15便士",
diff --git a/tests/data/zh_jp_ko_en.json b/tests/data/zh_jp_ko_en.json
deleted file mode 100644
index c4b94ee..0000000
--- a/tests/data/zh_jp_ko_en.json
+++ /dev/null
@@ -1,2420 +0,0 @@
-{
- "我是 VGroupChatBot,一个旨在支持多人通信的助手,通过可视化消息来帮助团队成员更好地交流。我可以帮助团队成员更好地整理和共享信息,特别是在讨论、会议和Brainstorming等情况下。你好我的名字是西野くまですmy name is bob很高兴认识你どうぞよろしくお願いいたします「こんにちは」是什么意思。": [
- {
- "text": "我是 ",
- "substrings": [
- {
- "lang": "zh",
- "text": "我是 ",
- "index": 0,
- "length": 3,
- "is_punctuation": false,
- "is_digit": false
- }
- ],
- "is_punctuation": false,
- "is_digit": false
- },
- {
- "text": "VGroupChatBot",
- "substrings": [
- {
- "lang": "en",
- "text": "VGroupChatBot",
- "index": 3,
- "length": 13,
- "is_punctuation": false,
- "is_digit": false
- }
- ],
- "is_punctuation": false,
- "is_digit": false
- },
- {
- "text": ",",
- "substrings": [
- {
- "lang": "punctuation",
- "text": ",",
- "index": 16,
- "length": 1,
- "is_punctuation": true,
- "is_digit": false
- }
- ],
- "is_punctuation": true,
- "is_digit": false
- },
- {
- "text": "一个旨在支持多人通信的助手",
- "substrings": [
- {
- "lang": "zh",
- "text": "一个旨在支持多人通信的助手",
- "index": 17,
- "length": 13,
- "is_punctuation": false,
- "is_digit": false
- }
- ],
- "is_punctuation": false,
- "is_digit": false
- },
- {
- "text": ",",
- "substrings": [
- {
- "lang": "punctuation",
- "text": ",",
- "index": 30,
- "length": 1,
- "is_punctuation": true,
- "is_digit": false
- }
- ],
- "is_punctuation": true,
- "is_digit": false
- },
- {
- "text": "通过可视化消息来帮助团队成员更好地交流",
- "substrings": [
- {
- "lang": "zh",
- "text": "通过可视化消息来帮助团队成员更好地交流",
- "index": 31,
- "length": 19,
- "is_punctuation": false,
- "is_digit": false
- }
- ],
- "is_punctuation": false,
- "is_digit": false
- },
- {
- "text": "。",
- "substrings": [
- {
- "lang": "punctuation",
- "text": "。",
- "index": 50,
- "length": 1,
- "is_punctuation": true,
- "is_digit": false
- }
- ],
- "is_punctuation": true,
- "is_digit": false
- },
- {
- "text": "我可以帮助团队成员更好地整理和共享信息",
- "substrings": [
- {
- "lang": "zh",
- "text": "我可以帮助团队成员更好地整理和共享信息",
- "index": 51,
- "length": 19,
- "is_punctuation": false,
- "is_digit": false
- }
- ],
- "is_punctuation": false,
- "is_digit": false
- },
- {
- "text": ",",
- "substrings": [
- {
- "lang": "punctuation",
- "text": ",",
- "index": 70,
- "length": 1,
- "is_punctuation": true,
- "is_digit": false
- }
- ],
- "is_punctuation": true,
- "is_digit": false
- },
- {
- "text": "特别是在讨论",
- "substrings": [
- {
- "lang": "zh",
- "text": "特别是在讨论",
- "index": 71,
- "length": 6,
- "is_punctuation": false,
- "is_digit": false
- }
- ],
- "is_punctuation": false,
- "is_digit": false
- },
- {
- "text": "、",
- "substrings": [
- {
- "lang": "punctuation",
- "text": "、",
- "index": 77,
- "length": 1,
- "is_punctuation": true,
- "is_digit": false
- }
- ],
- "is_punctuation": true,
- "is_digit": false
- },
- {
- "text": "会议和",
- "substrings": [
- {
- "lang": "zh",
- "text": "会议和",
- "index": 78,
- "length": 3,
- "is_punctuation": false,
- "is_digit": false
- }
- ],
- "is_punctuation": false,
- "is_digit": false
- },
- {
- "text": "Brainstorming",
- "substrings": [
- {
- "lang": "en",
- "text": "Brainstorming",
- "index": 81,
- "length": 13,
- "is_punctuation": false,
- "is_digit": false
- }
- ],
- "is_punctuation": false,
- "is_digit": false
- },
- {
- "text": "等情况下",
- "substrings": [
- {
- "lang": "zh",
- "text": "等情况下",
- "index": 94,
- "length": 4,
- "is_punctuation": false,
- "is_digit": false
- }
- ],
- "is_punctuation": false,
- "is_digit": false
- },
- {
- "text": "。",
- "substrings": [
- {
- "lang": "punctuation",
- "text": "。",
- "index": 98,
- "length": 1,
- "is_punctuation": true,
- "is_digit": false
- }
- ],
- "is_punctuation": true,
- "is_digit": false
- },
- {
- "text": "你好我的名字是西野くまです",
- "substrings": [
- {
- "lang": "zh",
- "text": "你好我的",
- "index": 99,
- "length": 4,
- "is_punctuation": false,
- "is_digit": false
- },
- {
- "lang": "ja",
- "text": "名字是西野くまです",
- "index": 103,
- "length": 9,
- "is_punctuation": false,
- "is_digit": false
- }
- ],
- "is_punctuation": false,
- "is_digit": false
- },
- {
- "text": "my name is bob",
- "substrings": [
- {
- "lang": "en",
- "text": "my name is bob",
- "index": 112,
- "length": 14,
- "is_punctuation": false,
- "is_digit": false
- }
- ],
- "is_punctuation": false,
- "is_digit": false
- },
- {
- "text": "很高兴认识你どうぞよろしくお願いいたします",
- "substrings": [
- {
- "lang": "zh",
- "text": "很高兴认识你",
- "index": 126,
- "length": 6,
- "is_punctuation": false,
- "is_digit": false
- },
- {
- "lang": "ja",
- "text": "どうぞよろしくお願いいたします",
- "index": 132,
- "length": 15,
- "is_punctuation": false,
- "is_digit": false
- }
- ],
- "is_punctuation": false,
- "is_digit": false
- },
- {
- "text": "「",
- "substrings": [
- {
- "lang": "punctuation",
- "text": "「",
- "index": 147,
- "length": 1,
- "is_punctuation": true,
- "is_digit": false
- }
- ],
- "is_punctuation": true,
- "is_digit": false
- },
- {
- "text": "こんにちは",
- "substrings": [
- {
- "lang": "ja",
- "text": "こんにちは",
- "index": 148,
- "length": 5,
- "is_punctuation": false,
- "is_digit": false
- }
- ],
- "is_punctuation": false,
- "is_digit": false
- },
- {
- "text": "」",
- "substrings": [
- {
- "lang": "punctuation",
- "text": "」",
- "index": 153,
- "length": 1,
- "is_punctuation": true,
- "is_digit": false
- }
- ],
- "is_punctuation": true,
- "is_digit": false
- },
- {
- "text": "是什么意思",
- "substrings": [
- {
- "lang": "zh",
- "text": "是什么意思",
- "index": 154,
- "length": 5,
- "is_punctuation": false,
- "is_digit": false
- }
- ],
- "is_punctuation": false,
- "is_digit": false
- },
- {
- "text": "。",
- "substrings": [
- {
- "lang": "punctuation",
- "text": "。",
- "index": 159,
- "length": 1,
- "is_punctuation": true,
- "is_digit": false
- }
- ],
- "is_punctuation": true,
- "is_digit": false
- }
- ],
- "你好,我的名字是西野くまです。I am from Tokyo, 日本の首都。今天的天气非常好,sky is clear and sunny。おはようございます、皆さん!我们一起来学习吧。Learning languages can be fun and exciting。昨日はとても忙しかったので、今日は少しリラックスしたいです。Let's take a break and enjoy some coffee。中文、日本語、and English are three distinct languages, each with its own unique charm。希望我们能一起进步,一起成长。Let's keep studying and improving our language skills together. ありがとう!": [
- {
- "text": "你好",
- "substrings": [
- {
- "lang": "zh",
- "text": "你好",
- "index": 0,
- "length": 2,
- "is_punctuation": false,
- "is_digit": false
- }
- ],
- "is_punctuation": false,
- "is_digit": false
- },
- {
- "text": ",",
- "substrings": [
- {
- "lang": "punctuation",
- "text": ",",
- "index": 2,
- "length": 1,
- "is_punctuation": true,
- "is_digit": false
- }
- ],
- "is_punctuation": true,
- "is_digit": false
- },
- {
- "text": "我的名字是西野くまです",
- "substrings": [
- {
- "lang": "zh",
- "text": "我的",
- "index": 3,
- "length": 2,
- "is_punctuation": false,
- "is_digit": false
- },
- {
- "lang": "ja",
- "text": "名字是西野くまです",
- "index": 5,
- "length": 9,
- "is_punctuation": false,
- "is_digit": false
- }
- ],
- "is_punctuation": false,
- "is_digit": false
- },
- {
- "text": "。",
- "substrings": [
- {
- "lang": "punctuation",
- "text": "。",
- "index": 14,
- "length": 1,
- "is_punctuation": true,
- "is_digit": false
- }
- ],
- "is_punctuation": true,
- "is_digit": false
- },
- {
- "text": "I am from Tokyo",
- "substrings": [
- {
- "lang": "en",
- "text": "I am from Tokyo",
- "index": 15,
- "length": 15,
- "is_punctuation": false,
- "is_digit": false
- }
- ],
- "is_punctuation": false,
- "is_digit": false
- },
- {
- "text": ", ",
- "substrings": [
- {
- "lang": "punctuation",
- "text": ", ",
- "index": 30,
- "length": 2,
- "is_punctuation": true,
- "is_digit": false
- }
- ],
- "is_punctuation": true,
- "is_digit": false
- },
- {
- "text": "日本の首都",
- "substrings": [
- {
- "lang": "ja",
- "text": "日本の",
- "index": 32,
- "length": 3,
- "is_punctuation": false,
- "is_digit": false
- },
- {
- "lang": "zh",
- "text": "首都",
- "index": 35,
- "length": 2,
- "is_punctuation": false,
- "is_digit": false
- }
- ],
- "is_punctuation": false,
- "is_digit": false
- },
- {
- "text": "。",
- "substrings": [
- {
- "lang": "punctuation",
- "text": "。",
- "index": 37,
- "length": 1,
- "is_punctuation": true,
- "is_digit": false
- }
- ],
- "is_punctuation": true,
- "is_digit": false
- },
- {
- "text": "今天的天气非常好",
- "substrings": [
- {
- "lang": "zh",
- "text": "今天的天气非常好",
- "index": 38,
- "length": 8,
- "is_punctuation": false,
- "is_digit": false
- }
- ],
- "is_punctuation": false,
- "is_digit": false
- },
- {
- "text": ",",
- "substrings": [
- {
- "lang": "punctuation",
- "text": ",",
- "index": 46,
- "length": 1,
- "is_punctuation": true,
- "is_digit": false
- }
- ],
- "is_punctuation": true,
- "is_digit": false
- },
- {
- "text": "sky is clear and sunny",
- "substrings": [
- {
- "lang": "ja",
- "text": "sky ",
- "index": 47,
- "length": 4,
- "is_punctuation": false,
- "is_digit": false
- },
- {
- "lang": "en",
- "text": "is clear and sunny",
- "index": 51,
- "length": 18,
- "is_punctuation": false,
- "is_digit": false
- }
- ],
- "is_punctuation": false,
- "is_digit": false
- },
- {
- "text": "。",
- "substrings": [
- {
- "lang": "punctuation",
- "text": "。",
- "index": 69,
- "length": 1,
- "is_punctuation": true,
- "is_digit": false
- }
- ],
- "is_punctuation": true,
- "is_digit": false
- },
- {
- "text": "おはようございます",
- "substrings": [
- {
- "lang": "ja",
- "text": "おはようございます",
- "index": 70,
- "length": 9,
- "is_punctuation": false,
- "is_digit": false
- }
- ],
- "is_punctuation": false,
- "is_digit": false
- },
- {
- "text": "、",
- "substrings": [
- {
- "lang": "punctuation",
- "text": "、",
- "index": 79,
- "length": 1,
- "is_punctuation": true,
- "is_digit": false
- }
- ],
- "is_punctuation": true,
- "is_digit": false
- },
- {
- "text": "皆さん",
- "substrings": [
- {
- "lang": "ja",
- "text": "皆さん",
- "index": 80,
- "length": 3,
- "is_punctuation": false,
- "is_digit": false
- }
- ],
- "is_punctuation": false,
- "is_digit": false
- },
- {
- "text": "!",
- "substrings": [
- {
- "lang": "punctuation",
- "text": "!",
- "index": 83,
- "length": 1,
- "is_punctuation": true,
- "is_digit": false
- }
- ],
- "is_punctuation": true,
- "is_digit": false
- },
- {
- "text": "我们一起来学习吧",
- "substrings": [
- {
- "lang": "zh",
- "text": "我们一起来学习吧",
- "index": 84,
- "length": 8,
- "is_punctuation": false,
- "is_digit": false
- }
- ],
- "is_punctuation": false,
- "is_digit": false
- },
- {
- "text": "。",
- "substrings": [
- {
- "lang": "punctuation",
- "text": "。",
- "index": 92,
- "length": 1,
- "is_punctuation": true,
- "is_digit": false
- }
- ],
- "is_punctuation": true,
- "is_digit": false
- },
- {
- "text": "Learning languages can be fun and exciting",
- "substrings": [
- {
- "lang": "en",
- "text": "Learning languages can be fun and exciting",
- "index": 93,
- "length": 42,
- "is_punctuation": false,
- "is_digit": false
- }
- ],
- "is_punctuation": false,
- "is_digit": false
- },
- {
- "text": "。",
- "substrings": [
- {
- "lang": "punctuation",
- "text": "。",
- "index": 135,
- "length": 1,
- "is_punctuation": true,
- "is_digit": false
- }
- ],
- "is_punctuation": true,
- "is_digit": false
- },
- {
- "text": "昨日はとても忙しかったので",
- "substrings": [
- {
- "lang": "ja",
- "text": "昨日はとても忙しかったので",
- "index": 136,
- "length": 13,
- "is_punctuation": false,
- "is_digit": false
- }
- ],
- "is_punctuation": false,
- "is_digit": false
- },
- {
- "text": "、",
- "substrings": [
- {
- "lang": "punctuation",
- "text": "、",
- "index": 149,
- "length": 1,
- "is_punctuation": true,
- "is_digit": false
- }
- ],
- "is_punctuation": true,
- "is_digit": false
- },
- {
- "text": "今日は少しリラックスしたいです",
- "substrings": [
- {
- "lang": "ja",
- "text": "今日は少しリラックスしたいです",
- "index": 150,
- "length": 15,
- "is_punctuation": false,
- "is_digit": false
- }
- ],
- "is_punctuation": false,
- "is_digit": false
- },
- {
- "text": "。",
- "substrings": [
- {
- "lang": "punctuation",
- "text": "。",
- "index": 165,
- "length": 1,
- "is_punctuation": true,
- "is_digit": false
- }
- ],
- "is_punctuation": true,
- "is_digit": false
- },
- {
- "text": "Let's take a break and enjoy some coffee",
- "substrings": [
- {
- "lang": "en",
- "text": "Let's take a break and enjoy some coffee",
- "index": 166,
- "length": 40,
- "is_punctuation": false,
- "is_digit": false
- }
- ],
- "is_punctuation": false,
- "is_digit": false
- },
- {
- "text": "。",
- "substrings": [
- {
- "lang": "punctuation",
- "text": "。",
- "index": 206,
- "length": 1,
- "is_punctuation": true,
- "is_digit": false
- }
- ],
- "is_punctuation": true,
- "is_digit": false
- },
- {
- "text": "中文",
- "substrings": [
- {
- "lang": "zh",
- "text": "中文",
- "index": 207,
- "length": 2,
- "is_punctuation": false,
- "is_digit": false
- }
- ],
- "is_punctuation": false,
- "is_digit": false
- },
- {
- "text": "、",
- "substrings": [
- {
- "lang": "punctuation",
- "text": "、",
- "index": 209,
- "length": 1,
- "is_punctuation": true,
- "is_digit": false
- }
- ],
- "is_punctuation": true,
- "is_digit": false
- },
- {
- "text": "日本語",
- "substrings": [
- {
- "lang": "zh-tw",
- "text": "日本語",
- "index": 210,
- "length": 3,
- "is_punctuation": false,
- "is_digit": false
- }
- ],
- "is_punctuation": false,
- "is_digit": false
- },
- {
- "text": "、",
- "substrings": [
- {
- "lang": "punctuation",
- "text": "、",
- "index": 213,
- "length": 1,
- "is_punctuation": true,
- "is_digit": false
- }
- ],
- "is_punctuation": true,
- "is_digit": false
- },
- {
- "text": "and English are three distinct languages",
- "substrings": [
- {
- "lang": "en",
- "text": "and English are three distinct languages",
- "index": 214,
- "length": 40,
- "is_punctuation": false,
- "is_digit": false
- }
- ],
- "is_punctuation": false,
- "is_digit": false
- },
- {
- "text": ", ",
- "substrings": [
- {
- "lang": "punctuation",
- "text": ", ",
- "index": 254,
- "length": 2,
- "is_punctuation": true,
- "is_digit": false
- }
- ],
- "is_punctuation": true,
- "is_digit": false
- },
- {
- "text": "each with its own unique charm",
- "substrings": [
- {
- "lang": "en",
- "text": "each with its own unique charm",
- "index": 256,
- "length": 30,
- "is_punctuation": false,
- "is_digit": false
- }
- ],
- "is_punctuation": false,
- "is_digit": false
- },
- {
- "text": "。",
- "substrings": [
- {
- "lang": "punctuation",
- "text": "。",
- "index": 286,
- "length": 1,
- "is_punctuation": true,
- "is_digit": false
- }
- ],
- "is_punctuation": true,
- "is_digit": false
- },
- {
- "text": "希望我们能一起进步",
- "substrings": [
- {
- "lang": "zh",
- "text": "希望我们能一起进步",
- "index": 287,
- "length": 9,
- "is_punctuation": false,
- "is_digit": false
- }
- ],
- "is_punctuation": false,
- "is_digit": false
- },
- {
- "text": ",",
- "substrings": [
- {
- "lang": "punctuation",
- "text": ",",
- "index": 296,
- "length": 1,
- "is_punctuation": true,
- "is_digit": false
- }
- ],
- "is_punctuation": true,
- "is_digit": false
- },
- {
- "text": "一起成长",
- "substrings": [
- {
- "lang": "zh",
- "text": "一起成长",
- "index": 297,
- "length": 4,
- "is_punctuation": false,
- "is_digit": false
- }
- ],
- "is_punctuation": false,
- "is_digit": false
- },
- {
- "text": "。",
- "substrings": [
- {
- "lang": "punctuation",
- "text": "。",
- "index": 301,
- "length": 1,
- "is_punctuation": true,
- "is_digit": false
- }
- ],
- "is_punctuation": true,
- "is_digit": false
- },
- {
- "text": "Let's keep studying and improving our language skills together",
- "substrings": [
- {
- "lang": "en",
- "text": "Let's keep studying and improving our language skills together",
- "index": 302,
- "length": 62,
- "is_punctuation": false,
- "is_digit": false
- }
- ],
- "is_punctuation": false,
- "is_digit": false
- },
- {
- "text": ". ",
- "substrings": [
- {
- "lang": "punctuation",
- "text": ". ",
- "index": 364,
- "length": 2,
- "is_punctuation": true,
- "is_digit": false
- }
- ],
- "is_punctuation": true,
- "is_digit": false
- },
- {
- "text": "ありがとう",
- "substrings": [
- {
- "lang": "ja",
- "text": "ありがとう",
- "index": 366,
- "length": 5,
- "is_punctuation": false,
- "is_digit": false
- }
- ],
- "is_punctuation": false,
- "is_digit": false
- },
- {
- "text": "!",
- "substrings": [
- {
- "lang": "punctuation",
- "text": "!",
- "index": 371,
- "length": 1,
- "is_punctuation": true,
- "is_digit": false
- }
- ],
- "is_punctuation": true,
- "is_digit": false
- }
- ],
- "你好,今日はどこへ行きますか?": [
- {
- "text": "你好",
- "substrings": [
- {
- "lang": "zh",
- "text": "你好",
- "index": 0,
- "length": 2,
- "is_punctuation": false,
- "is_digit": false
- }
- ],
- "is_punctuation": false,
- "is_digit": false
- },
- {
- "text": ",",
- "substrings": [
- {
- "lang": "punctuation",
- "text": ",",
- "index": 2,
- "length": 1,
- "is_punctuation": true,
- "is_digit": false
- }
- ],
- "is_punctuation": true,
- "is_digit": false
- },
- {
- "text": "今日はどこへ行きますか",
- "substrings": [
- {
- "lang": "ja",
- "text": "今日はどこへ行きますか",
- "index": 3,
- "length": 11,
- "is_punctuation": false,
- "is_digit": false
- }
- ],
- "is_punctuation": false,
- "is_digit": false
- },
- {
- "text": "?",
- "substrings": [
- {
- "lang": "punctuation",
- "text": "?",
- "index": 14,
- "length": 1,
- "is_punctuation": true,
- "is_digit": false
- }
- ],
- "is_punctuation": true,
- "is_digit": false
- }
- ],
- "你好今日はどこへ行きますか?": [
- {
- "text": "你好今日はどこへ行きますか",
- "substrings": [
- {
- "lang": "zh",
- "text": "你好",
- "index": 0,
- "length": 2,
- "is_punctuation": false,
- "is_digit": false
- },
- {
- "lang": "ja",
- "text": "今日はどこへ行きますか",
- "index": 2,
- "length": 11,
- "is_punctuation": false,
- "is_digit": false
- }
- ],
- "is_punctuation": false,
- "is_digit": false
- },
- {
- "text": "?",
- "substrings": [
- {
- "lang": "punctuation",
- "text": "?",
- "index": 13,
- "length": 1,
- "is_punctuation": true,
- "is_digit": false
- }
- ],
- "is_punctuation": true,
- "is_digit": false
- }
- ],
- "我的名字是田中さんです。": [
- {
- "text": "我的名字是田中さんです",
- "substrings": [
- {
- "lang": "zh",
- "text": "我的",
- "index": 0,
- "length": 2,
- "is_punctuation": false,
- "is_digit": false
- },
- {
- "lang": "ja",
- "text": "名字是田中さんです",
- "index": 2,
- "length": 9,
- "is_punctuation": false,
- "is_digit": false
- }
- ],
- "is_punctuation": false,
- "is_digit": false
- },
- {
- "text": "。",
- "substrings": [
- {
- "lang": "punctuation",
- "text": "。",
- "index": 11,
- "length": 1,
- "is_punctuation": true,
- "is_digit": false
- }
- ],
- "is_punctuation": true,
- "is_digit": false
- }
- ],
- "我喜欢吃寿司和拉面おいしいです。": [
- {
- "text": "我喜欢吃寿司和拉面おいしいです",
- "substrings": [
- {
- "lang": "en",
- "text": "我喜",
- "index": 0,
- "length": 2,
- "is_punctuation": false,
- "is_digit": false
- },
- {
- "lang": "zh",
- "text": "欢吃寿司和拉面",
- "index": 2,
- "length": 7,
- "is_punctuation": false,
- "is_digit": false
- },
- {
- "lang": "ja",
- "text": "おいしいです",
- "index": 9,
- "length": 6,
- "is_punctuation": false,
- "is_digit": false
- }
- ],
- "is_punctuation": false,
- "is_digit": false
- },
- {
- "text": "。",
- "substrings": [
- {
- "lang": "punctuation",
- "text": "。",
- "index": 15,
- "length": 1,
- "is_punctuation": true,
- "is_digit": false
- }
- ],
- "is_punctuation": true,
- "is_digit": false
- }
- ],
- "今天の天気はとてもいいですね。": [
- {
- "text": "今天の天気はとてもいいですね",
- "substrings": [
- {
- "lang": "ja",
- "text": "今天の天気はとてもいいですね",
- "index": 0,
- "length": 14,
- "is_punctuation": false,
- "is_digit": false
- }
- ],
- "is_punctuation": false,
- "is_digit": false
- },
- {
- "text": "。",
- "substrings": [
- {
- "lang": "punctuation",
- "text": "。",
- "index": 14,
- "length": 1,
- "is_punctuation": true,
- "is_digit": false
- }
- ],
- "is_punctuation": true,
- "is_digit": false
- }
- ],
- "我在学习日本語少し難しいです。": [
- {
- "text": "我在学习日本語少し難しいです",
- "substrings": [
- {
- "lang": "zh",
- "text": "我在学习",
- "index": 0,
- "length": 4,
- "is_punctuation": false,
- "is_digit": false
- },
- {
- "lang": "ja",
- "text": "日",
- "index": 4,
- "length": 1,
- "is_punctuation": false,
- "is_digit": false
- },
- {
- "lang": "zh-tw",
- "text": "本語",
- "index": 5,
- "length": 2,
- "is_punctuation": false,
- "is_digit": false
- },
- {
- "lang": "ja",
- "text": "少し難しいです",
- "index": 7,
- "length": 7,
- "is_punctuation": false,
- "is_digit": false
- }
- ],
- "is_punctuation": false,
- "is_digit": false
- },
- {
- "text": "。",
- "substrings": [
- {
- "lang": "punctuation",
- "text": "。",
- "index": 14,
- "length": 1,
- "is_punctuation": true,
- "is_digit": false
- }
- ],
- "is_punctuation": true,
- "is_digit": false
- }
- ],
- "日语真是おもしろい啊": [
- {
- "text": "日语真是おもしろい啊",
- "substrings": [
- {
- "lang": "zh",
- "text": "日语真是",
- "index": 0,
- "length": 4,
- "is_punctuation": false,
- "is_digit": false
- },
- {
- "lang": "ja",
- "text": "おもしろい啊",
- "index": 4,
- "length": 6,
- "is_punctuation": false,
- "is_digit": false
- }
- ],
- "is_punctuation": false,
- "is_digit": false
- }
- ],
- "你喜欢看アニメ吗?": [
- {
- "text": "你喜欢看アニメ吗",
- "substrings": [
- {
- "lang": "zh",
- "text": "你喜欢看アニメ吗",
- "index": 0,
- "length": 8,
- "is_punctuation": false,
- "is_digit": false
- }
- ],
- "is_punctuation": false,
- "is_digit": false
- },
- {
- "text": "?",
- "substrings": [
- {
- "lang": "punctuation",
- "text": "?",
- "index": 8,
- "length": 1,
- "is_punctuation": true,
- "is_digit": false
- }
- ],
- "is_punctuation": true,
- "is_digit": false
- }
- ],
- "我想去日本旅行、特に京都に行きたいです。": [
- {
- "text": "我想去日本旅行",
- "substrings": [
- {
- "lang": "zh",
- "text": "我想去日本旅行",
- "index": 0,
- "length": 7,
- "is_punctuation": false,
- "is_digit": false
- }
- ],
- "is_punctuation": false,
- "is_digit": false
- },
- {
- "text": "、",
- "substrings": [
- {
- "lang": "punctuation",
- "text": "、",
- "index": 7,
- "length": 1,
- "is_punctuation": true,
- "is_digit": false
- }
- ],
- "is_punctuation": true,
- "is_digit": false
- },
- {
- "text": "特に京都に行きたいです",
- "substrings": [
- {
- "lang": "ja",
- "text": "特に京都に行きたいです",
- "index": 8,
- "length": 11,
- "is_punctuation": false,
- "is_digit": false
- }
- ],
- "is_punctuation": false,
- "is_digit": false
- },
- {
- "text": "。",
- "substrings": [
- {
- "lang": "punctuation",
- "text": "。",
- "index": 19,
- "length": 1,
- "is_punctuation": true,
- "is_digit": false
- }
- ],
- "is_punctuation": true,
- "is_digit": false
- }
- ],
- "昨天見た映画はとても感動的でした。我朋友是日本人彼はとても優しいです。": [
- {
- "text": "昨天見た映画はとても感動的でした",
- "substrings": [
- {
- "lang": "zh",
- "text": "昨天",
- "index": 0,
- "length": 2,
- "is_punctuation": false,
- "is_digit": false
- },
- {
- "lang": "zh-tw",
- "text": "見",
- "index": 2,
- "length": 1,
- "is_punctuation": false,
- "is_digit": false
- },
- {
- "lang": "ja",
- "text": "た",
- "index": 3,
- "length": 1,
- "is_punctuation": false,
- "is_digit": false
- },
- {
- "lang": "zh",
- "text": "映画",
- "index": 4,
- "length": 2,
- "is_punctuation": false,
- "is_digit": false
- },
- {
- "lang": "ja",
- "text": "はとても感動的でした",
- "index": 6,
- "length": 10,
- "is_punctuation": false,
- "is_digit": false
- }
- ],
- "is_punctuation": false,
- "is_digit": false
- },
- {
- "text": "。",
- "substrings": [
- {
- "lang": "punctuation",
- "text": "。",
- "index": 16,
- "length": 1,
- "is_punctuation": true,
- "is_digit": false
- }
- ],
- "is_punctuation": true,
- "is_digit": false
- },
- {
- "text": "我朋友是日本人彼はとても優しいです",
- "substrings": [
- {
- "lang": "en",
- "text": "我朋",
- "index": 17,
- "length": 2,
- "is_punctuation": false,
- "is_digit": false
- },
- {
- "lang": "zh",
- "text": "友是日本人",
- "index": 19,
- "length": 5,
- "is_punctuation": false,
- "is_digit": false
- },
- {
- "lang": "ja",
- "text": "彼はとても優しいです",
- "index": 24,
- "length": 10,
- "is_punctuation": false,
- "is_digit": false
- }
- ],
- "is_punctuation": false,
- "is_digit": false
- },
- {
- "text": "。",
- "substrings": [
- {
- "lang": "punctuation",
- "text": "。",
- "index": 34,
- "length": 1,
- "is_punctuation": true,
- "is_digit": false
- }
- ],
- "is_punctuation": true,
- "is_digit": false
- }
- ],
- "我们一起去カラオケ吧、楽しそうです。": [
- {
- "text": "我们一起去カラオケ吧",
- "substrings": [
- {
- "lang": "zh",
- "text": "我们一起去カラオケ吧",
- "index": 0,
- "length": 10,
- "is_punctuation": false,
- "is_digit": false
- }
- ],
- "is_punctuation": false,
- "is_digit": false
- },
- {
- "text": "、",
- "substrings": [
- {
- "lang": "punctuation",
- "text": "、",
- "index": 10,
- "length": 1,
- "is_punctuation": true,
- "is_digit": false
- }
- ],
- "is_punctuation": true,
- "is_digit": false
- },
- {
- "text": "楽しそうです",
- "substrings": [
- {
- "lang": "ja",
- "text": "楽しそうです",
- "index": 11,
- "length": 6,
- "is_punctuation": false,
- "is_digit": false
- }
- ],
- "is_punctuation": false,
- "is_digit": false
- },
- {
- "text": "。",
- "substrings": [
- {
- "lang": "punctuation",
- "text": "。",
- "index": 17,
- "length": 1,
- "is_punctuation": true,
- "is_digit": false
- }
- ],
- "is_punctuation": true,
- "is_digit": false
- }
- ],
- "我的家在北京、でも、仕事で東京に住んでいます。": [
- {
- "text": "我的家在北京",
- "substrings": [
- {
- "lang": "zh",
- "text": "我的家在北京",
- "index": 0,
- "length": 6,
- "is_punctuation": false,
- "is_digit": false
- }
- ],
- "is_punctuation": false,
- "is_digit": false
- },
- {
- "text": "、",
- "substrings": [
- {
- "lang": "punctuation",
- "text": "、",
- "index": 6,
- "length": 1,
- "is_punctuation": true,
- "is_digit": false
- }
- ],
- "is_punctuation": true,
- "is_digit": false
- },
- {
- "text": "でも",
- "substrings": [
- {
- "lang": "ja",
- "text": "でも",
- "index": 7,
- "length": 2,
- "is_punctuation": false,
- "is_digit": false
- }
- ],
- "is_punctuation": false,
- "is_digit": false
- },
- {
- "text": "、",
- "substrings": [
- {
- "lang": "punctuation",
- "text": "、",
- "index": 9,
- "length": 1,
- "is_punctuation": true,
- "is_digit": false
- }
- ],
- "is_punctuation": true,
- "is_digit": false
- },
- {
- "text": "仕事で東京に住んでいます",
- "substrings": [
- {
- "lang": "ja",
- "text": "仕事で東京に住んでいます",
- "index": 10,
- "length": 12,
- "is_punctuation": false,
- "is_digit": false
- }
- ],
- "is_punctuation": false,
- "is_digit": false
- },
- {
- "text": "。",
- "substrings": [
- {
- "lang": "punctuation",
- "text": "。",
- "index": 22,
- "length": 1,
- "is_punctuation": true,
- "is_digit": false
- }
- ],
- "is_punctuation": true,
- "is_digit": false
- }
- ],
- "我在学做日本料理、日本料理を作るのを習っています。": [
- {
- "text": "我在学做日本料理",
- "substrings": [
- {
- "lang": "zh",
- "text": "我在学做日本料理",
- "index": 0,
- "length": 8,
- "is_punctuation": false,
- "is_digit": false
- }
- ],
- "is_punctuation": false,
- "is_digit": false
- },
- {
- "text": "、",
- "substrings": [
- {
- "lang": "punctuation",
- "text": "、",
- "index": 8,
- "length": 1,
- "is_punctuation": true,
- "is_digit": false
- }
- ],
- "is_punctuation": true,
- "is_digit": false
- },
- {
- "text": "日本料理を作るのを習っています",
- "substrings": [
- {
- "lang": "ja",
- "text": "日本料理を作るのを習っています",
- "index": 9,
- "length": 15,
- "is_punctuation": false,
- "is_digit": false
- }
- ],
- "is_punctuation": false,
- "is_digit": false
- },
- {
- "text": "。",
- "substrings": [
- {
- "lang": "punctuation",
- "text": "。",
- "index": 24,
- "length": 1,
- "is_punctuation": true,
- "is_digit": false
- }
- ],
- "is_punctuation": true,
- "is_digit": false
- }
- ],
- "你会说几种语言、何ヶ国語話せますか?": [
- {
- "text": "你会说几种语言",
- "substrings": [
- {
- "lang": "zh",
- "text": "你会说几种语言",
- "index": 0,
- "length": 7,
- "is_punctuation": false,
- "is_digit": false
- }
- ],
- "is_punctuation": false,
- "is_digit": false
- },
- {
- "text": "、",
- "substrings": [
- {
- "lang": "punctuation",
- "text": "、",
- "index": 7,
- "length": 1,
- "is_punctuation": true,
- "is_digit": false
- }
- ],
- "is_punctuation": true,
- "is_digit": false
- },
- {
- "text": "何ヶ国語話せますか",
- "substrings": [
- {
- "lang": "ja",
- "text": "何ヶ国語話せますか",
- "index": 8,
- "length": 9,
- "is_punctuation": false,
- "is_digit": false
- }
- ],
- "is_punctuation": false,
- "is_digit": false
- },
- {
- "text": "?",
- "substrings": [
- {
- "lang": "punctuation",
- "text": "?",
- "index": 17,
- "length": 1,
- "is_punctuation": true,
- "is_digit": false
- }
- ],
- "is_punctuation": true,
- "is_digit": false
- }
- ],
- "我昨天看了一本书、その本はとても面白かったです。": [
- {
- "text": "我昨天看了一本书",
- "substrings": [
- {
- "lang": "zh",
- "text": "我昨天看了一本书",
- "index": 0,
- "length": 8,
- "is_punctuation": false,
- "is_digit": false
- }
- ],
- "is_punctuation": false,
- "is_digit": false
- },
- {
- "text": "、",
- "substrings": [
- {
- "lang": "punctuation",
- "text": "、",
- "index": 8,
- "length": 1,
- "is_punctuation": true,
- "is_digit": false
- }
- ],
- "is_punctuation": true,
- "is_digit": false
- },
- {
- "text": "その本はとても面白かったです",
- "substrings": [
- {
- "lang": "ja",
- "text": "その本はとても面白かったです",
- "index": 9,
- "length": 14,
- "is_punctuation": false,
- "is_digit": false
- }
- ],
- "is_punctuation": false,
- "is_digit": false
- },
- {
- "text": "。",
- "substrings": [
- {
- "lang": "punctuation",
- "text": "。",
- "index": 23,
- "length": 1,
- "is_punctuation": true,
- "is_digit": false
- }
- ],
- "is_punctuation": true,
- "is_digit": false
- }
- ],
- "你最近好吗、最近どうですか?": [
- {
- "text": "你最近好吗",
- "substrings": [
- {
- "lang": "zh",
- "text": "你最近好吗",
- "index": 0,
- "length": 5,
- "is_punctuation": false,
- "is_digit": false
- }
- ],
- "is_punctuation": false,
- "is_digit": false
- },
- {
- "text": "、",
- "substrings": [
- {
- "lang": "punctuation",
- "text": "、",
- "index": 5,
- "length": 1,
- "is_punctuation": true,
- "is_digit": false
- }
- ],
- "is_punctuation": true,
- "is_digit": false
- },
- {
- "text": "最近どうですか",
- "substrings": [
- {
- "lang": "ja",
- "text": "最近どうですか",
- "index": 6,
- "length": 7,
- "is_punctuation": false,
- "is_digit": false
- }
- ],
- "is_punctuation": false,
- "is_digit": false
- },
- {
- "text": "?",
- "substrings": [
- {
- "lang": "punctuation",
- "text": "?",
- "index": 13,
- "length": 1,
- "is_punctuation": true,
- "is_digit": false
- }
- ],
- "is_punctuation": true,
- "is_digit": false
- }
- ],
- "我在学做日本料理와 한국 요리、日本料理を作るのを習っています。": [
- {
- "text": "我在学做日本料理",
- "substrings": [
- {
- "lang": "zh",
- "text": "我在学做日本料理",
- "index": 0,
- "length": 8,
- "is_punctuation": false,
- "is_digit": false
- }
- ],
- "is_punctuation": false,
- "is_digit": false
- },
- {
- "text": "와 한국 요리",
- "substrings": [
- {
- "lang": "ko",
- "text": "와 한국 요리",
- "index": 8,
- "length": 7,
- "is_punctuation": false,
- "is_digit": false
- }
- ],
- "is_punctuation": false,
- "is_digit": false
- },
- {
- "text": "、",
- "substrings": [
- {
- "lang": "punctuation",
- "text": "、",
- "index": 15,
- "length": 1,
- "is_punctuation": true,
- "is_digit": false
- }
- ],
- "is_punctuation": true,
- "is_digit": false
- },
- {
- "text": "日本料理を作るのを習っています",
- "substrings": [
- {
- "lang": "ja",
- "text": "日本料理を作るのを習っています",
- "index": 16,
- "length": 15,
- "is_punctuation": false,
- "is_digit": false
- }
- ],
- "is_punctuation": false,
- "is_digit": false
- },
- {
- "text": "。",
- "substrings": [
- {
- "lang": "punctuation",
- "text": "。",
- "index": 31,
- "length": 1,
- "is_punctuation": true,
- "is_digit": false
- }
- ],
- "is_punctuation": true,
- "is_digit": false
- }
- ],
- "你会说几种语言、何ヶ国語話せますか?몇 개 언어를 할 수 있어요?": [
- {
- "text": "你会说几种语言",
- "substrings": [
- {
- "lang": "zh",
- "text": "你会说几种语言",
- "index": 0,
- "length": 7,
- "is_punctuation": false,
- "is_digit": false
- }
- ],
- "is_punctuation": false,
- "is_digit": false
- },
- {
- "text": "、",
- "substrings": [
- {
- "lang": "punctuation",
- "text": "、",
- "index": 7,
- "length": 1,
- "is_punctuation": true,
- "is_digit": false
- }
- ],
- "is_punctuation": true,
- "is_digit": false
- },
- {
- "text": "何ヶ国語話せますか",
- "substrings": [
- {
- "lang": "ja",
- "text": "何ヶ国語話せますか",
- "index": 8,
- "length": 9,
- "is_punctuation": false,
- "is_digit": false
- }
- ],
- "is_punctuation": false,
- "is_digit": false
- },
- {
- "text": "?",
- "substrings": [
- {
- "lang": "punctuation",
- "text": "?",
- "index": 17,
- "length": 1,
- "is_punctuation": true,
- "is_digit": false
- }
- ],
- "is_punctuation": true,
- "is_digit": false
- },
- {
- "text": "몇 개 언어를 할 수 있어요",
- "substrings": [
- {
- "lang": "ko",
- "text": "몇 개 언어를 할 수 있어요",
- "index": 18,
- "length": 15,
- "is_punctuation": false,
- "is_digit": false
- }
- ],
- "is_punctuation": false,
- "is_digit": false
- },
- {
- "text": "?",
- "substrings": [
- {
- "lang": "punctuation",
- "text": "?",
- "index": 33,
- "length": 1,
- "is_punctuation": true,
- "is_digit": false
- }
- ],
- "is_punctuation": true,
- "is_digit": false
- }
- ],
- "我昨天看了一本书、その本はとても面白かったです。어제 책을 읽었는데, 정말 재미있었어요。": [
- {
- "text": "我昨天看了一本书",
- "substrings": [
- {
- "lang": "zh",
- "text": "我昨天看了一本书",
- "index": 0,
- "length": 8,
- "is_punctuation": false,
- "is_digit": false
- }
- ],
- "is_punctuation": false,
- "is_digit": false
- },
- {
- "text": "、",
- "substrings": [
- {
- "lang": "punctuation",
- "text": "、",
- "index": 8,
- "length": 1,
- "is_punctuation": true,
- "is_digit": false
- }
- ],
- "is_punctuation": true,
- "is_digit": false
- },
- {
- "text": "その本はとても面白かったです",
- "substrings": [
- {
- "lang": "ja",
- "text": "その本はとても面白かったです",
- "index": 9,
- "length": 14,
- "is_punctuation": false,
- "is_digit": false
- }
- ],
- "is_punctuation": false,
- "is_digit": false
- },
- {
- "text": "。",
- "substrings": [
- {
- "lang": "punctuation",
- "text": "。",
- "index": 23,
- "length": 1,
- "is_punctuation": true,
- "is_digit": false
- }
- ],
- "is_punctuation": true,
- "is_digit": false
- },
- {
- "text": "어제 책을 읽었는데",
- "substrings": [
- {
- "lang": "ko",
- "text": "어제 책을 읽었는데",
- "index": 24,
- "length": 10,
- "is_punctuation": false,
- "is_digit": false
- }
- ],
- "is_punctuation": false,
- "is_digit": false
- },
- {
- "text": ", ",
- "substrings": [
- {
- "lang": "punctuation",
- "text": ", ",
- "index": 34,
- "length": 2,
- "is_punctuation": true,
- "is_digit": false
- }
- ],
- "is_punctuation": true,
- "is_digit": false
- },
- {
- "text": "정말 재미있었어요",
- "substrings": [
- {
- "lang": "ko",
- "text": "정말 재미있었어요",
- "index": 36,
- "length": 9,
- "is_punctuation": false,
- "is_digit": false
- }
- ],
- "is_punctuation": false,
- "is_digit": false
- },
- {
- "text": "。",
- "substrings": [
- {
- "lang": "punctuation",
- "text": "。",
- "index": 45,
- "length": 1,
- "is_punctuation": true,
- "is_digit": false
- }
- ],
- "is_punctuation": true,
- "is_digit": false
- }
- ],
- "我们一起去逛街와 쇼핑、買い物に行きましょう。쇼핑하러 가요。": [
- {
- "text": "我们一起去逛街",
- "substrings": [
- {
- "lang": "zh",
- "text": "我们一起去逛街",
- "index": 0,
- "length": 7,
- "is_punctuation": false,
- "is_digit": false
- }
- ],
- "is_punctuation": false,
- "is_digit": false
- },
- {
- "text": "와 쇼핑",
- "substrings": [
- {
- "lang": "ko",
- "text": "와 쇼핑",
- "index": 7,
- "length": 4,
- "is_punctuation": false,
- "is_digit": false
- }
- ],
- "is_punctuation": false,
- "is_digit": false
- },
- {
- "text": "、",
- "substrings": [
- {
- "lang": "punctuation",
- "text": "、",
- "index": 11,
- "length": 1,
- "is_punctuation": true,
- "is_digit": false
- }
- ],
- "is_punctuation": true,
- "is_digit": false
- },
- {
- "text": "買い物に行きましょう",
- "substrings": [
- {
- "lang": "ja",
- "text": "買い物に行きましょう",
- "index": 12,
- "length": 10,
- "is_punctuation": false,
- "is_digit": false
- }
- ],
- "is_punctuation": false,
- "is_digit": false
- },
- {
- "text": "。",
- "substrings": [
- {
- "lang": "punctuation",
- "text": "。",
- "index": 22,
- "length": 1,
- "is_punctuation": true,
- "is_digit": false
- }
- ],
- "is_punctuation": true,
- "is_digit": false
- },
- {
- "text": "쇼핑하러 가요",
- "substrings": [
- {
- "lang": "ko",
- "text": "쇼핑하러 가요",
- "index": 23,
- "length": 7,
- "is_punctuation": false,
- "is_digit": false
- }
- ],
- "is_punctuation": false,
- "is_digit": false
- },
- {
- "text": "。",
- "substrings": [
- {
- "lang": "punctuation",
- "text": "。",
- "index": 30,
- "length": 1,
- "is_punctuation": true,
- "is_digit": false
- }
- ],
- "is_punctuation": true,
- "is_digit": false
- }
- ],
- "你最近好吗、最近どうですか?요즘 어떻게 지내요?": [
- {
- "text": "你最近好吗",
- "substrings": [
- {
- "lang": "zh",
- "text": "你最近好吗",
- "index": 0,
- "length": 5,
- "is_punctuation": false,
- "is_digit": false
- }
- ],
- "is_punctuation": false,
- "is_digit": false
- },
- {
- "text": "、",
- "substrings": [
- {
- "lang": "punctuation",
- "text": "、",
- "index": 5,
- "length": 1,
- "is_punctuation": true,
- "is_digit": false
- }
- ],
- "is_punctuation": true,
- "is_digit": false
- },
- {
- "text": "最近どうですか",
- "substrings": [
- {
- "lang": "ja",
- "text": "最近どうですか",
- "index": 6,
- "length": 7,
- "is_punctuation": false,
- "is_digit": false
- }
- ],
- "is_punctuation": false,
- "is_digit": false
- },
- {
- "text": "?",
- "substrings": [
- {
- "lang": "punctuation",
- "text": "?",
- "index": 13,
- "length": 1,
- "is_punctuation": true,
- "is_digit": false
- }
- ],
- "is_punctuation": true,
- "is_digit": false
- },
- {
- "text": "요즘 어떻게 지내요",
- "substrings": [
- {
- "lang": "ko",
- "text": "요즘 어떻게 지내요",
- "index": 14,
- "length": 10,
- "is_punctuation": false,
- "is_digit": false
- }
- ],
- "is_punctuation": false,
- "is_digit": false
- },
- {
- "text": "?",
- "substrings": [
- {
- "lang": "punctuation",
- "text": "?",
- "index": 24,
- "length": 1,
- "is_punctuation": true,
- "is_digit": false
- }
- ],
- "is_punctuation": true,
- "is_digit": false
- }
- ]
-}
\ No newline at end of file
diff --git a/tests/split_acc.py b/tests/split_acc.py
index eeef300..5251258 100644
--- a/tests/split_acc.py
+++ b/tests/split_acc.py
@@ -1,16 +1,19 @@
+import time
from typing import List
-from split_lang import split_by_lang
-from split_lang.split.splitter import SubString, TextSplitter, _get_languages
-from split_lang.split.utils import PUNCTUATION, DEFAULT_THRESHOLD
-from split_lang.detect_lang.detector import DEFAULT_LANG
+from split_lang.config import DEFAULT_LANG
+from split_lang.model import LangSectionType
+from split_lang.split.splitter import (
+ SubString,
+ LangSplitter,
+)
+from split_lang.split.utils import PUNCTUATION
from tests.test_config import TEST_DATA_FOLDER
-from wtpsplit import SaT, WtP
-
-import time
-def get_corrected_split_result(text_file_path: str) -> List[List[SubString]]:
+def get_corrected_split_result(
+ splitter: LangSplitter, text_file_path: str
+) -> List[List[SubString]]:
"""
# 1. split by `|`
# 2. convert to SubString, concat to list
@@ -47,28 +50,20 @@ def get_corrected_split_result(text_file_path: str) -> List[List[SubString]]:
)
current_index += len(substring)
- substring_objects = _get_languages(
+ substring_objects = splitter._get_languages(
lang_text_list=substring_objects,
- default_lang="en",
+ lang_section_type=LangSectionType.ALL,
)
corrected_split_result.append(substring_objects)
return corrected_split_result
-# splitter = TextSplitter()
-sat = SaT("sat-1l-sm")
-sat.half().to("cuda")
-wtp = WtP("wtp-bert-mini")
-# wtp.half().to("cuda")
-splitter = TextSplitter(wtp_split_model=wtp)
-
-
-def simple_test(threshold: float, verbose: bool = False):
+def simple_test(splitter: LangSplitter, debug: bool = False):
text_file_name = "correct_split_merge_punc.txt"
correct_split = get_corrected_split_result(
- text_file_path=f"{TEST_DATA_FOLDER}/{text_file_name}"
+ splitter=splitter, text_file_path=f"{TEST_DATA_FOLDER}/{text_file_name}"
)
correct_total_substring_len = 0
test_total_substring_len = 0
@@ -90,11 +85,8 @@ def simple_test(threshold: float, verbose: bool = False):
original_strings.append(original_string)
# print(original_string)
- test_split_substrings = split_by_lang(
+ test_split_substrings = splitter.split_by_lang(
text=original_string,
- splitter=splitter,
- threshold=threshold,
- merge_across_punctuation=True,
)
test_split.append(test_split_substrings)
test_total_substring_len += len(test_split_substrings)
@@ -114,7 +106,7 @@ def simple_test(threshold: float, verbose: bool = False):
correct_split_num += 1
current_correct_num += 1
break
- if verbose:
+ if debug:
print(f"correct_substrings : {correct_substrings_text}")
print(f"test_split_substrings: {test_split_substrings_text}")
print(
@@ -126,7 +118,7 @@ def simple_test(threshold: float, verbose: bool = False):
precision = correct_split_num / correct_total_substring_len
recall = correct_split_num / test_total_substring_len
f1_score = 2 * precision * recall / (precision + recall)
- if verbose:
+ if debug:
print(f"total substring num: {correct_total_substring_len}")
print(f"test total substring num: {test_total_substring_len}")
print(f"text acc num: {correct_split_num}")
@@ -138,15 +130,15 @@ def simple_test(threshold: float, verbose: bool = False):
return precision
-def find_best_threshold():
+def find_best_threshold(splitter: LangSplitter):
best_score = 0
best_threshold = 0
for times in range(5):
for i in range(1, 10):
zeros = "0" * times
threshold = float(f"0.{zeros}{str(i)}")
- score = simple_test(threshold=threshold, verbose=False)
- if score >= best_score:
+ score = simple_test(splitter=splitter, debug=False)
+ if score > best_score:
best_score = score
best_threshold = threshold
print(f"updated: best_f1_score: {best_score}")
@@ -158,8 +150,9 @@ def find_best_threshold():
def main():
- # find_best_threshold()
- simple_test(threshold=DEFAULT_THRESHOLD, verbose=True)
+ splitter = LangSplitter(merge_across_punctuation=True)
+ # find_best_threshold(splitter=splitter)
+ simple_test(splitter=splitter, debug=True)
if __name__ == "__main__":