Add more test text examples including different languages #5

DoodleBears · 2024-07-07T11:28:01Z

Motivation

Current test text examples focus on Chinese, Japanese, Korean and English
Since I can only speak Chinese, Japanese and English, if you your use cases meet splitting error, please reply on this thread

fytz282117 · 2024-10-29T09:47:42Z

当我使用下面这个例子时，提示KeyError: 'digit'

from split_lang import LangSplitter
lang_splitter = LangSplitter()
text = "2.术语和定义 2.Terms and Definitions"
substr = lang_splitter.split_by_lang(
    text=text,
)
for index, item in enumerate(substr):
    print(f"{index}|{item.lang}:{item.text}")

需要在substring_text_len_by_lang中添加一个属性digit=0吗？

DoodleBears · 2024-10-30T11:53:10Z

是Bug，我修复一下，新开了一个 issue
#27

fytz282117 · 2024-10-31T06:33:18Z

是Bug，我修复一下，新开了一个 issue #27

没有完全修复。
以下这两个例子，数字和符号混合开头的，依然存在问题：

text = "2.2术语和定义 2.2 Terms and Definitions"
text = "(2.2)术语和定义 (2.2)Terms and Definitions"

另外调试时的测试代码没有注释，打印出来了

测试：LangSectionType.ZH_JA
测试：LangSectionType.DIGIT
测试：LangSectionType.OTHERS

DoodleBears · 2024-10-31T10:17:09Z

是Bug，我修复一下，新开了一个 issue #27

没有完全修复。
以下这两个例子，数字和符号混合开头的，依然存在问题：
text = "2.2术语和定义 2.2 Terms and Definitions"
text = "(2.2)术语和定义 (2.2)Terms and Definitions"
另外调试时的测试代码没有注释，打印出来了
测试：LangSectionType.ZH_JA
测试：LangSectionType.DIGIT
测试：LangSectionType.OTHERS

感谢捉虫，我再来看看

DoodleBears · 2024-10-31T10:20:58Z

@fytz282117 要是你有很多测试 split 的文本，也可以直接 PR 或者贴在这个 comment 下面，现在的鲁棒性确实是拉跨，总共就切分 digit 和 punctuation，你这一说 2 个都是 bug（🤣）

DoodleBears · 2024-10-31T14:11:16Z

是Bug，我修复一下，新开了一个 issue #27

没有完全修复。以下这两个例子，数字和符号混合开头的，依然存在问题：
text = "2.2术语和定义 2.2 Terms and Definitions"
text = "(2.2)术语和定义 (2.2)Terms and Definitions"
另外调试时的测试代码没有注释，打印出来了
测试：LangSectionType.ZH_JA
测试：LangSectionType.DIGIT
测试：LangSectionType.OTHERS

我跑了一下，merge_across_punctuation = True 是下面这样的结果

0|punctuation:(2.
1|zh:2)术语和定义 (2.2)
2|en:Terms and Definitions

如果 merge_across_punctuation = False

0|punctuation:(2.2)
1|zh:术语和定义 
2|punctuation:(2.2)
3|en:Terms and Definitions

感觉应该比较理想的情况下 merge_across_punctuation 合并后，应该标记为非 punctuation 部分的语言类型

DoodleBears · 2024-10-31T15:43:56Z

考虑到符号比较多样，现在的情况是
merge_across_punctuation = True 的时候 punctuation 会和前一个 section 合并

lang_splitter.merge_across_digit = False
lang_splitter.merge_across_punctuation = True

0|digit:(2.2)
1|zh:术语和定义 (
2|digit:2.2)
3|en:Terms and Definitions

merge_across_digit = True 的时候 digit 会和附近的符号合并

lang_splitter.merge_across_digit = True
lang_splitter.merge_across_punctuation = False

0|zh:(2.2)术语和定义
1|en:(2.2)Terms and Definitions

merge_across_digit = True 且 merge_across_punctuation = True 是先执行 merge_across_digit ，后执行 merge_across_punctuation

lang_splitter.merge_across_digit = True
lang_splitter.merge_across_punctuation = True

0|zh:(2.2)术语和定义
1|en:(2.2)Terms and Definitions

DoodleBears added enhancement New feature or request help wanted Extra attention is needed good first issue Good for newcomers labels Jul 8, 2024

DoodleBears mentioned this issue Oct 31, 2024

wrong split result when punctuations and digits are near each other #29

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add more test text examples including different languages #5

Add more test text examples including different languages #5

DoodleBears commented Jul 7, 2024 •

edited

Loading

fytz282117 commented Oct 29, 2024 •

edited

Loading

DoodleBears commented Oct 30, 2024 •

edited

Loading

fytz282117 commented Oct 31, 2024 •

edited

Loading

DoodleBears commented Oct 31, 2024

DoodleBears commented Oct 31, 2024

DoodleBears commented Oct 31, 2024 •

edited

Loading

DoodleBears commented Oct 31, 2024

Add more test text examples including different languages #5

Add more test text examples including different languages #5

Comments

DoodleBears commented Jul 7, 2024 • edited Loading

Motivation

fytz282117 commented Oct 29, 2024 • edited Loading

DoodleBears commented Oct 30, 2024 • edited Loading

fytz282117 commented Oct 31, 2024 • edited Loading

DoodleBears commented Oct 31, 2024

DoodleBears commented Oct 31, 2024

DoodleBears commented Oct 31, 2024 • edited Loading

DoodleBears commented Oct 31, 2024

DoodleBears commented Jul 7, 2024 •

edited

Loading

fytz282117 commented Oct 29, 2024 •

edited

Loading

DoodleBears commented Oct 30, 2024 •

edited

Loading

fytz282117 commented Oct 31, 2024 •

edited

Loading

DoodleBears commented Oct 31, 2024 •

edited

Loading