You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
the dataset in this example is sorted by label: the first 500 tweets are toxic and the rest non-toxic
the example does not explicitly shuffle it, but it still works because cross_validate for classification uses a stratified k fold by default. However that is not immediately obvious, and I think there has been some discussion recently in scikit-learn that it might be better not to stratify by default. So I think it might be good to actually shuffle the dataset.
Not sure where would be the best place to do it -- in the hosted zip file, in the fetcher, in the example, or by passing shuffle=True to cross_validate
In this case, I think that I would rather shuffle in the examples: this is a problem that is frequent in the actual applications and I would like 1) people to see it, 2) our code to still be elegant when having to shuffle.
Real-world applications and datasets have many issues and aren't necessarily shuffled by default. If we want skrub to reflect not only toy examples like scikit-learn does, then I believe it's best to use the stratify parameter explicitly instead of having a perfect dataset.
Describe the issue linked to the documentation
the dataset in this example is sorted by label: the first 500 tweets are toxic and the rest non-toxic
the example does not explicitly shuffle it, but it still works because
cross_validate
for classification uses a stratified k fold by default. However that is not immediately obvious, and I think there has been some discussion recently in scikit-learn that it might be better not to stratify by default. So I think it might be good to actually shuffle the dataset.Not sure where would be the best place to do it -- in the hosted zip file, in the fetcher, in the example, or by passing shuffle=True to cross_validate
WDYT @Vincent-Maladiere
Suggest a potential alternative/fix
No response
The text was updated successfully, but these errors were encountered: