Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SLEP022: Fixing randomness ambiguities #88

Open
wants to merge 26 commits into
base: main
Choose a base branch
from

Conversation

NicolasHug
Copy link
Member

@NicolasHug NicolasHug commented May 12, 2023

This SLEP is NOT the same as its predecessor #24

Disclaimer: I won't be able to champion this SLEP.
I'm opening it here now because I hope it can help better framing the discussions happening in scikit-learn/scikit-learn#26148.

@betatim
Copy link
Member

betatim commented May 15, 2023

I'd like to take on this SLEP if it is Ok with you.

@NicolasHug
Copy link
Member Author

Sure, happy for you to take over @betatim , thanks!



Abstract
========
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mention that we make these changes in behaviour together with the switch to the new Numpy random infrastructure. Combined with the plan of random_state= to rng= transition for the argument name.

if 'random_state' not in est.get_params():
raise ValueError("This estimator isn't random and can only have exact clones")

def cross_val_score(est, X, y, cv, use_exact_clones=True):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rename use_exact_clones so that statistical clones are allowed but it is fine to have deterministic estimators as well.

Maybe allow_statistical_clone=

Copy link
Member

@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Quick notes based on our drafting meeting discussion:

This means that ideally, calling `est.fit(X, y)` should yield the same model
twice. We have a check for that in the `check_estimator()` suite:
`check_fit_idempotent()`. Clearly, this fit-idempotency property is violated
when None is used.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
when None is used.
when None or a `np.random.RandomState` instance is used.

twice. We have a check for that in the `check_estimator()` suite:
`check_fit_idempotent()`. Clearly, this fit-idempotency property is violated
when None is used.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: we could improve our doc and states that scikit-learn compatible estimator fit results should be at least "statistically idempotent" (or "idempotent in expected value") unless and random_state is passed as an integer seed, in which case they should be strictly idempotent.

I think this is a sufficient requirement.

are using.

This also introduces a private attribute, so we would need more intrusive
changes to `set_params`, `get_params`, and `clone`.
Copy link
Member

@ogrisel ogrisel Jun 30, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we could alternatively explore the possibility to do stateless consumption in fit instead of seeding in __init__.

    def fit(self):  # or split()
        rng = check_random_state(self.random_state, copy=True)
        ...

This way we would still enforce the stateless fit semantics for the est.random_state attribute.

I will try to find the time to explore this option an alternative notebook in a gist.

@joelostblom
Copy link

Thanks for working on this! I think these clarification in the docs and the updated behavior would be really helpful to reduce ambiguity. I noticed that in scikit-learn/scikit-learn#26148 (comment) point 3 there was a mention that the new NumPy Generators would support a different behavior than the old RandomState objects (if I understood correctly). I didn't see a mention of that distinction in this proposal, is this difference still planned to be implemented?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

4 participants