Text cleaning: scope of the package

Hi ! 

Thanks for this game changer package. 

I am starting to work on a text classification issue. 

Looking at the source code, I see some that some text cleaning routines, before performing any tokenization have been included. 

First, they rely on NLTK corpus which can be challenging in offline environments as on premise Insee's infrastructure. However, my problem with that goes beyond that technical point.

In my opinion, this package should not be a swiss knife, this will make it harder to maintain (which, in itself is quite challenging even with small packages). Unless I am mistaken, I don't think fasttext performs that kind of cleaning before estimation or inference. I think it is a good idea: user needs to be aware of that methodological choice and do that by himself. 

My proposal: getting rid of that kind of code. Users would provide ready to use text data for estimation or inference 

## Related issue

- InseeFrLab/torch-fastText#29 as well as InseeFrLab/torch-fastText#27 would be solved by that

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Text cleaning: scope of the package #5

Related issue

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Text cleaning: scope of the package #5

Description

Related issue

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions