-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add more test text examples including different languages #5
Comments
当我使用下面这个例子时,提示KeyError: 'digit' from split_lang import LangSplitter
lang_splitter = LangSplitter()
text = "2.术语和定义 2.Terms and Definitions"
substr = lang_splitter.split_by_lang(
text=text,
)
for index, item in enumerate(substr):
print(f"{index}|{item.lang}:{item.text}") 需要在substring_text_len_by_lang中添加一个属性digit=0吗? |
是Bug,我修复一下,新开了一个 issue |
没有完全修复。 text = "2.2术语和定义 2.2 Terms and Definitions"
text = "(2.2)术语和定义 (2.2)Terms and Definitions" 另外调试时的测试代码没有注释,打印出来了 测试:LangSectionType.ZH_JA
测试:LangSectionType.DIGIT
测试:LangSectionType.OTHERS |
感谢捉虫,我再来看看 |
@fytz282117 要是你有很多测试 split 的文本,也可以直接 PR 或者贴在这个 comment 下面,现在的鲁棒性确实是拉跨,总共就切分 digit 和 punctuation,你这一说 2 个都是 bug(🤣) |
我跑了一下,
如果
感觉应该比较理想的情况下 |
考虑到符号比较多样,现在的情况是
|
Motivation
The text was updated successfully, but these errors were encountered: