Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

shuffling the toxicity dataset? #1234

Open
jeromedockes opened this issue Feb 7, 2025 · 2 comments
Open

shuffling the toxicity dataset? #1234

jeromedockes opened this issue Feb 7, 2025 · 2 comments
Labels
documentation Add or improve the documentation no changelog needed

Comments

@jeromedockes
Copy link
Member

Describe the issue linked to the documentation

the dataset in this example is sorted by label: the first 500 tweets are toxic and the rest non-toxic

the example does not explicitly shuffle it, but it still works because cross_validate for classification uses a stratified k fold by default. However that is not immediately obvious, and I think there has been some discussion recently in scikit-learn that it might be better not to stratify by default. So I think it might be good to actually shuffle the dataset.

Not sure where would be the best place to do it -- in the hosted zip file, in the fetcher, in the example, or by passing shuffle=True to cross_validate

WDYT @Vincent-Maladiere

Suggest a potential alternative/fix

No response

@jeromedockes jeromedockes added documentation Add or improve the documentation no changelog needed labels Feb 7, 2025
@GaelVaroquaux
Copy link
Member

GaelVaroquaux commented Feb 7, 2025 via email

@Vincent-Maladiere
Copy link
Member

Real-world applications and datasets have many issues and aren't necessarily shuffled by default. If we want skrub to reflect not only toy examples like scikit-learn does, then I believe it's best to use the stratify parameter explicitly instead of having a perfect dataset.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Add or improve the documentation no changelog needed
Projects
None yet
Development

No branches or pull requests

3 participants