Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Outdate/broken Sentence tokenizer? #1118

Closed
parhammmm opened this issue Aug 12, 2024 · 3 comments · Fixed by #1126
Closed

Outdate/broken Sentence tokenizer? #1118

parhammmm opened this issue Aug 12, 2024 · 3 comments · Fixed by #1126
Labels
bug Something isn't working

Comments

@parhammmm
Copy link
Contributor

Came across an issue earlier, that highlighted the way the tokenizer works. I'm referring to the try/catch here

export const splitBySentenceTokenizer = (): TextSplitterFn => {
if (!sentenceTokenizer) {
sentenceTokenizer = new SentenceTokenizerNew();
}
const tokenizer = sentenceTokenizer;
return (text: string) => {
try {
return tokenizer.tokenize(text);
} catch {
return [text];
}
};
};

I noticed natural is being shimmed in but it's the older version which is buggy and making the try/catch necessary. The new Sentence Splitter works quite well and doesn't have the same issues that would cause the tokenizer to fail.

Using a deprecated tokenizer and hiding the errors from it is quite messy which isn't ideal.

natural is CommonJS only and so that makes sense as why it's being shimmed in but could we update to the latest natural and do something like the following inside splitBySentenceTokenizer while we wait for NaturalNode/natural#744? would that break something in the packaging in LLamaindexTS @himself65 ?

  const {
    default: { SentenceTokenizer },
  } = await import("natural");

We can still try catch on the function but at least make it not silent... console.log at the minimum?

@petergoldstein
Copy link
Contributor

@himself65 Any thoughts on this? Would there be support for a PR that addresses updating the natural version?

@himself65
Copy link
Member

I used tsup to bundle the natural package, yeah let me do a upgrade later

@himself65 himself65 added the bug Something isn't working label Aug 14, 2024
@parhammmm
Copy link
Contributor Author

@himself65 thanks!  🙏

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants