You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This is required for stacking and leave-one-out target encoders if we want a nice design. We already kind of do this in some places but don't have a coherent contract. I think a slep would be nice.
How so? The point is whether we can adjust the contract / user expectations. Either we need to redefine the contract or we need to provide an interface.
The implementation is the easy part (that's probably true for all the SLEPs)
In terms of "we kind of do this in some places", the handling of training
points in kneighbors/radius_neighbors is a fair example of something
similar.
The places we have encountered such transforms include StackingTransformer (
scikit-learn/scikit-learn#8960) and
NearestNeighborsTransformer (
scikit-learn/scikit-learn#10482). Both could be
redesigned as meta-estimators, but would lose some conceptual simplicity.
We could also handle this case more explicitly by implementing fit_resample
where the estimator can use transform at test time.
Basically what MLR does is that fit returns the training-time version of transform which basically solves the problem without introducing a new verb/method.
However, method chaining is so idiomatic in sklearn that I don't really see how to get around this. We could also "just" introduce a different verb/method say forward that does training time transformation and returns X and y that's used in meta-estimators (mostly pipeline).
actually SLEP 001 (#2) basically implements something like this, only SLEP 001 also has a test-time version of it, which I think is a bit weird, or rather it conflates two issues: distinguishing training vs test time, and returning a complex object / tuple.
We need some more elaborate version of 2) for the target encoder (that I've brought up many times now lol): scikit-learn/scikit-learn#17323
SLEP 1 tries to do all of the together, I think which might not be necessary (we couldn't come up with good examples).
We could achieve stacking and target transformers and more straight-forward handling in the neighbor graphs by:
"just allowing" fit_transform != transform. I haven't thought deeply about the consequences yet. What are the issues that will come up?
piggy-backing on resampling as @jnothmansuggests above and re-use fit_resample to do the training-set transformation. In that case I still think it's weird to call the method resample. In that case we'd also have the potential to modify y during training.
add another verb, and enforce sample-alignment (and maybe or maybe not also return y).
cc @GaelVaroquaux who would like to move this forward :)
Good point that @GaelVaroquaux brought up that was probably also in the other thread:
we don't want to allow fit_resample in a ColumnTransformer but we want to allow the target encoder.
Activity
raamana commentedon Feb 7, 2019
A
FitTransformMixin
common to all estimators would solve it perhaps?amueller commentedon Feb 7, 2019
How so? The point is whether we can adjust the contract / user expectations. Either we need to redefine the contract or we need to provide an interface.
The implementation is the easy part (that's probably true for all the SLEPs)
jnothman commentedon Feb 7, 2019
amueller commentedon Feb 17, 2020
I just realized that this is the place where I should have put scikit-learn/scikit-learn#15553
Basically what MLR does is that
fit
returns the training-time version of transform which basically solves the problem without introducing a new verb/method.However, method chaining is so idiomatic in sklearn that I don't really see how to get around this. We could also "just" introduce a different verb/method say
forward
that does training time transformation and returns X and y that's used in meta-estimators (mostly pipeline).That would also solve #15.
amueller commentedon Feb 17, 2020
actually SLEP 001 (#2) basically implements something like this, only SLEP 001 also has a test-time version of it, which I think is a bit weird, or rather it conflates two issues: distinguishing training vs test time, and returning a complex object / tuple.
amueller commentedon Dec 7, 2020
There's a relevant discussion happening here:
#15 (comment)
Maybe it's fair to say that there are three extensions to transform we're considering:
We're doing all three in SLEP 5, but with a focus on doing non-sample-aligned transformations, and only allowing somewhat trivial versions of 2 and 3.
We need some more elaborate version of 2) for the target encoder (that I've brought up many times now lol): scikit-learn/scikit-learn#17323
SLEP 1 tries to do all of the together, I think which might not be necessary (we couldn't come up with good examples).
We could achieve stacking and target transformers and more straight-forward handling in the neighbor graphs by:
fit_resample
to do the training-set transformation. In that case I still think it's weird to call the method resample. In that case we'd also have the potential to modify y during training.cc @GaelVaroquaux who would like to move this forward :)
amueller commentedon Dec 8, 2020
Good point that @GaelVaroquaux brought up that was probably also in the other thread:
we don't want to allow
fit_resample
in a ColumnTransformer but we want to allow the target encoder.jnothman commentedon Dec 10, 2020