Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add more test text examples including different languages #5

Open
DoodleBears opened this issue Jul 7, 2024 · 7 comments
Open

Add more test text examples including different languages #5

DoodleBears opened this issue Jul 7, 2024 · 7 comments
Labels
enhancement New feature or request good first issue Good for newcomers help wanted Extra attention is needed

Comments

@DoodleBears
Copy link
Owner

DoodleBears commented Jul 7, 2024

Motivation

  • Current test text examples focus on Chinese, Japanese, Korean and English
  • Since I can only speak Chinese, Japanese and English, if you your use cases meet splitting error, please reply on this thread
@DoodleBears DoodleBears added enhancement New feature or request help wanted Extra attention is needed good first issue Good for newcomers labels Jul 8, 2024
@fytz282117
Copy link

fytz282117 commented Oct 29, 2024

当我使用下面这个例子时,提示KeyError: 'digit'

from split_lang import LangSplitter
lang_splitter = LangSplitter()
text = "2.术语和定义 2.Terms and Definitions"
substr = lang_splitter.split_by_lang(
    text=text,
)
for index, item in enumerate(substr):
    print(f"{index}|{item.lang}:{item.text}")

需要在substring_text_len_by_lang中添加一个属性digit=0吗?

@DoodleBears
Copy link
Owner Author

DoodleBears commented Oct 30, 2024

是Bug,我修复一下,新开了一个 issue
#27

@fytz282117
Copy link

fytz282117 commented Oct 31, 2024

是Bug,我修复一下,新开了一个 issue #27

没有完全修复。
以下这两个例子,数字和符号混合开头的,依然存在问题:

text = "2.2术语和定义 2.2 Terms and Definitions"
text = "(2.2)术语和定义 (2.2)Terms and Definitions"

另外调试时的测试代码没有注释,打印出来了

测试:LangSectionType.ZH_JA
测试:LangSectionType.DIGIT
测试:LangSectionType.OTHERS

@DoodleBears
Copy link
Owner Author

是Bug,我修复一下,新开了一个 issue #27

没有完全修复。
以下这两个例子,数字和符号混合开头的,依然存在问题:

text = "2.2术语和定义 2.2 Terms and Definitions"
text = "(2.2)术语和定义 (2.2)Terms and Definitions"

另外调试时的测试代码没有注释,打印出来了

测试:LangSectionType.ZH_JA
测试:LangSectionType.DIGIT
测试:LangSectionType.OTHERS

感谢捉虫,我再来看看

@DoodleBears
Copy link
Owner Author

@fytz282117 要是你有很多测试 split 的文本,也可以直接 PR 或者贴在这个 comment 下面,现在的鲁棒性确实是拉跨,总共就切分 digit 和 punctuation,你这一说 2 个都是 bug(🤣)

@DoodleBears
Copy link
Owner Author

DoodleBears commented Oct 31, 2024

是Bug,我修复一下,新开了一个 issue #27

没有完全修复。 以下这两个例子,数字和符号混合开头的,依然存在问题:

text = "2.2术语和定义 2.2 Terms and Definitions"
text = "(2.2)术语和定义 (2.2)Terms and Definitions"

另外调试时的测试代码没有注释,打印出来了

测试:LangSectionType.ZH_JA
测试:LangSectionType.DIGIT
测试:LangSectionType.OTHERS

我跑了一下,merge_across_punctuation = True 是下面这样的结果

0|punctuation:(2.
1|zh:2)术语和定义 (2.2)
2|en:Terms and Definitions

如果 merge_across_punctuation = False

0|punctuation:(2.2)
1|zh:术语和定义 
2|punctuation:(2.2)
3|en:Terms and Definitions

感觉应该比较理想的情况下 merge_across_punctuation 合并后,应该标记为 非 punctuation 部分的语言类型

@DoodleBears
Copy link
Owner Author

考虑到符号比较多样,现在的情况是
merge_across_punctuation = True 的时候 punctuation 会和前一个 section 合并

lang_splitter.merge_across_digit = False
lang_splitter.merge_across_punctuation = True

0|digit:(2.2)
1|zh:术语和定义 (
2|digit:2.2)
3|en:Terms and Definitions

merge_across_digit = True 的时候 digit 会和附近的符号合并

lang_splitter.merge_across_digit = True
lang_splitter.merge_across_punctuation = False

0|zh:(2.2)术语和定义
1|en:(2.2)Terms and Definitions

merge_across_digit = Truemerge_across_punctuation = True 是先执行 merge_across_digit ,后执行 merge_across_punctuation

lang_splitter.merge_across_digit = True
lang_splitter.merge_across_punctuation = True

0|zh:(2.2)术语和定义
1|en:(2.2)Terms and Definitions

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request good first issue Good for newcomers help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

2 participants