Setting sub-estimator parameters (eg in GridSearchCV) #796
Replies: 9 comments 7 replies
-
Proposal by @GaelVaroquaux during the last meeting: set the default values as class constants and only clone if the transformer corresponds to this constant.
|
Beta Was this translation helpful? Give feedback.
-
Olivier Grisel gave another option: redefine get_params and set_params |
Beta Was this translation helpful? Give feedback.
-
Proposal by @GaelVaroquaux during the last meeting: set the default values as class constants
I would actually go for module-level constants (we don't want to clutter the class and the corresponding instances with it).
And also, write them in ALL_CAPS (they are constants).
|
Beta Was this translation helpful? Give feedback.
-
Olivier Grisel gave a probably better option: redefine get_params and set_params
I don't like it. I would rather touch as little as possible to get_params/set_params.
|
Beta Was this translation helpful? Give feedback.
-
What is the problem with the first solution? For the second solution, if a user changes an attribute in the passed transformer after init, the changes won't affect the |
Beta Was this translation helpful? Give feedback.
-
Of the options we discussed only overriding from sklearn.base import BaseEstimator, RegressorMixin, clone
from sklearn.dummy import DummyRegressor
from sklearn.utils.estimator_checks import check_estimator
from sklearn.utils.validation import check_is_fitted
from sklearn.datasets import make_regression
class DefaultNone(RegressorMixin, BaseEstimator):
"""Default is None, create in __init__"""
def __init__(self, regressor=None):
self.regressor = regressor if regressor is not None else DummyRegressor()
def fit(self, X, y):
self.regressor_ = clone(self.regressor).fit(X, y)
return self
def predict(self, X):
check_is_fitted(self, attributes=["regressor_"])
return self.regressor_.predict(X)
def _more_tags(self):
return self.regressor._more_tags()
class AlwaysCloneDefault(RegressorMixin, BaseEstimator):
"""Default is an estimator, clone in __init__"""
def __init__(self, regressor=DummyRegressor()):
self.regressor = clone(regressor)
def fit(self, X, y):
self.regressor_ = clone(self.regressor).fit(X, y)
return self
def predict(self, X):
check_is_fitted(self, attributes=["regressor_"])
return self.regressor_.predict(X)
def _more_tags(self):
return self.regressor._more_tags()
DEFAULT_REGRESSOR = DummyRegressor()
class DefaultGlobalConstant(RegressorMixin, BaseEstimator):
"""Default is an estimator, clone in __init__ if default is passed"""
def __init__(self, regressor=DEFAULT_REGRESSOR):
self.regressor = (
clone(regressor) if regressor is DEFAULT_REGRESSOR else regressor
)
def fit(self, X, y):
self.regressor_ = clone(self.regressor).fit(X, y)
return self
def predict(self, X):
check_is_fitted(self, attributes=["regressor_"])
return self.regressor_.predict(X)
def _more_tags(self):
return self.regressor._more_tags()
class RedefineGetSetParams(RegressorMixin, BaseEstimator):
"""Default is None, None stored in __init__, override set_params"""
def __init__(self, regressor=None):
self.regressor = regressor
self._regressor_params = {}
def fit(self, X, y):
self.regressor_ = (
DummyRegressor() if self.regressor is None else clone(self.regressor)
)
params = {
k.removeprefix("regressor__"): v for k, v in self._regressor_params.items()
}
self.regressor_.set_params(**params)
self.regressor_.fit(X, y)
return self
def predict(self, X):
check_is_fitted(self, attributes=["regressor_"])
return self.regressor_.predict(X)
def _more_tags(self):
return DummyRegressor()._more_tags()
def set_params(self, **parameters):
if self.regressor is not None:
super().set_params(**parameters)
for param, value in parameters.items():
if param.startswith("regressor__"):
self._regressor_params[param] = value
else:
setattr(self, param, value)
return self
def get_params(self, deep=True):
return super().get_params(deep) | self._regressor_params
if __name__ == "__main__":
OK = "\033[92mOK\033[39m"
FAIL = "\033[91mFAIL\033[39m"
def d(est):
return f"\033[94m{est.__name__: >25}\033[39m"
estimator_types = [
DefaultNone,
AlwaysCloneDefault,
DefaultGlobalConstant,
RedefineGetSetParams,
]
print("""
check_estimator
---------------
""")
for est_type in estimator_types:
try:
check_estimator(est_type())
print(f"{d(est_type)}: {OK}")
except Exception as e:
print(f"{d(est_type)}: {FAIL} {e}")
print("""
set param
---------
""")
X, y = make_regression()
for est_type in estimator_types:
try:
est = est_type()
est.set_params(regressor__strategy="median")
est.fit(X, y)
assert est.regressor_.strategy == "median"
print(f"{d(est_type)}: {OK}")
except Exception as e:
print(f"{d(est_type)}: {FAIL} {e}")
|
Beta Was this translation helpful? Give feedback.
-
Thank you very much for this study @jeromedockes. I guess we can say that our two current contenders are
For the sake of argument, I don't believe we should be 100% scikit-learn compatible, nor should it be our objective. So my +1 goes to |
Beta Was this translation helpful? Give feedback.
-
BTW there is no real difference between using The only difference between the 2 is how the default argument is displayed in the source code and in the rendered documentation. (The only approach that respects the "attribute is equal to default argument" scikit-learn convention is redefining |
Beta Was this translation helpful? Give feedback.
-
we've reached a decsion on this and implemented it a while ago |
Beta Was this translation helpful? Give feedback.
-
Summarizing the discussion from the IRL meeting.
The
TableVectorizer
contains several transformers which are themselves scikit-learn estimators with hyperparameters, such as itshigh_card_cat_transformer
.Those can be provided by the user at initialization.
Moreover, the
TableVectorizer
provides default estimators for the transformers (unlike, say, scikit-learn'sColumnTransformer
).We would like to be able to set the parameters of the transformers in a grid search.
At the moment, the default value for the transformers is
None
, and if the default is passed, a new tranformer is created duringfit
.Therefore, after initialization but before fit the transformers don't yet exist as attributes of the
TableVectorizer
, so they cannot be accessed byset_params
and therefore cannot be included in the grid search.For the example above to work,
vectorizer.high_card_cat_transformer
needs to be aGapEncoder
whenTableVectorizer.__init__
finishes.A couple of solutions have been proposed.
Create the transformer in
__init__
rather than infit
or
drawback: this breaks the scikit-learn convention that the estimator arguments must be stored as attributes (rather than storing copies, or clones, or something else) and will fail scikit-learn estimator checks
the second one is probably our least bad option, and actually I'm not sure it is not allowed; see the doc and check ping @glemaitre
Use a transformer as default value and clone it during
fit
big drawback: a user can modify the default value inadvertently:
Beta Was this translation helpful? Give feedback.
All reactions