-
Notifications
You must be signed in to change notification settings - Fork 2
Description
Hi !
Thanks for this game changer package.
I am starting to work on a text classification issue.
Looking at the source code, I see some that some text cleaning routines, before performing any tokenization have been included.
First, they rely on NLTK corpus which can be challenging in offline environments as on premise Insee's infrastructure. However, my problem with that goes beyond that technical point.
In my opinion, this package should not be a swiss knife, this will make it harder to maintain (which, in itself is quite challenging even with small packages). Unless I am mistaken, I don't think fasttext performs that kind of cleaning before estimation or inference. I think it is a good idea: user needs to be aware of that methodological choice and do that by himself.
My proposal: getting rid of that kind of code. Users would provide ready to use text data for estimation or inference