Setting sub-estimator parameters (eg in GridSearchCV) #796

jeromedockes · 2023-10-16T14:03:10Z

jeromedockes
Oct 16, 2023
Maintainer

Summarizing the discussion from the IRL meeting.
The TableVectorizer contains several transformers which are themselves scikit-learn estimators with hyperparameters, such as its high_card_cat_transformer.
Those can be provided by the user at initialization.
Moreover, the TableVectorizer provides default estimators for the transformers (unlike, say, scikit-learn's ColumnTransformer).
We would like to be able to set the parameters of the transformers in a grid search.

At the moment, the default value for the transformers is None, and if the default is passed, a new tranformer is created during fit.
Therefore, after initialization but before fit the transformers don't yet exist as attributes of the TableVectorizer, so they cannot be accessed by set_params and therefore cannot be included in the grid search.

from sklearn.dummy import DummyRegressor
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline

from skrub import TableVectorizer, GapEncoder, datasets

vectorizer = TableVectorizer(high_card_cat_transformer=None)  # the default
assert vectorizer.high_card_cat_transformer is None

# AttributeError: 'NoneType' object has no attribute 'set_params'
vectorizer.set_params(high_card_cat_transformer__n_components=10)

# Same problem with the GridSearch
employee_salaries = datasets.fetch_employee_salaries()
X = employee_salaries.X.iloc[:200]
y = employee_salaries.y.iloc[:200]

estimator = Pipeline(
    [("vectorizer", TableVectorizer()), ("regressor", DummyRegressor())]
)
param_grid = {"vectorizer__high_card_cat_transformer__n_components": [10, 20]}
grid_search = GridSearchCV(estimator, param_grid)

# AttributeError: 'NoneType' object has no attribute 'set_params'
grid_search.fit(X, y)

# The way we can set those parameters at the moment is to provide the full
# transformer one level higher in the parameter grid:

param_grid = {
    "vectorizer__high_card_cat_transformer": [
        GapEncoder(n_components=10),
        GapEncoder(n_components=20),
    ]
}
grid_search = GridSearchCV(estimator, param_grid)

# OK
grid_search.fit(X, y)

# (which is not _that_ bad IMO)

For the example above to work, vectorizer.high_card_cat_transformer needs to be a GapEncoder when TableVectorizer.__init__ finishes.

A couple of solutions have been proposed.

Create the transformer in `init` rather than in `fit`

class TableVectorizer:
    def __init__(self, high_card_cat_transformer=None):
        self.high_card_cat_transformer = high_card_cat_transformer
        if self.high_card_cat_transformer is None:
            self.high_card_cat_transformer = GapEncoder()

or

class TableVectorizer:
    def __init__(self, high_card_cat_transformer=GapEncoder()):
        self.high_card_cat_transformer = clone(high_card_cat_transformer)

drawback: this breaks the scikit-learn convention that the estimator arguments must be stored as attributes (rather than storing copies, or clones, or something else) and will fail scikit-learn estimator checks

the second one is probably our least bad option, and actually I'm not sure it is not allowed; see the doc and check ping @glemaitre

Use a transformer as default value and clone it during `fit`

class TableVectorizer:
    def __init__(self, high_card_cat_transformer=GapEncoder()):
        self.high_card_cat_transformer = high_card_cat_transformer
        
    def fit(self, X):
        self.high_card_cat_tranformer_ = clone(self.high_card_cat_tranformer).fit(X)

big drawback: a user can modify the default value inadvertently:

from skrub import TableVectorizer

vectorizer = TableVectorizer()
assert vectorizer.high_card_cat_transformer.n_components == 30
vectorizer.set_params(high_card_cat_transformer__n_components=10)

other_vectorizer = TableVectorizer()
# surprising
assert vectorizer.high_card_cat_transformer.n_components == 10

LeoGrin · 2023-10-25T13:59:03Z

LeoGrin
Oct 25, 2023
Collaborator

Proposal by @GaelVaroquaux during the last meeting: set the default values as class constants and only clone if the transformer corresponds to this constant.

class TableVectorizer:
    high_card_cat_transformer_default = GapEncoder()
    def __init__(self, high_card_cat_transformer=high_card_cat_transformer_default):
        if high_card_cat_transformer is self.high_card_cat_transformer_default:
            self.high_card_cat_transformer = clone(high_card_cat_transformer)
        else:
            self.high_card_cat_transformer = high_card_cat_transformer

0 replies

jeromedockes · 2023-10-25T14:02:35Z

jeromedockes
Oct 25, 2023
Maintainer Author

Olivier Grisel gave another option: redefine get_params and set_params

0 replies

GaelVaroquaux · 2023-10-25T14:03:17Z

GaelVaroquaux
Oct 25, 2023
Maintainer

Proposal by @GaelVaroquaux during the last meeting: set the default values as class constants

I would actually go for module-level constants (we don't want to clutter the class and the corresponding instances with it). And also, write them in ALL_CAPS (they are constants).

0 replies

GaelVaroquaux · 2023-10-25T14:04:02Z

GaelVaroquaux
Oct 25, 2023
Maintainer

Olivier Grisel gave a probably better option: redefine get_params and set_params

I don't like it. I would rather touch as little as possible to get_params/set_params.

0 replies

LeoGrin · 2023-10-25T14:19:50Z

LeoGrin
Oct 25, 2023
Collaborator

the second one is probably our least bad option

What is the problem with the first solution? For the second solution, if a user changes an attribute in the passed transformer after init, the changes won't affect the TableVectorizer. Also, checking quickly, it seems that sklearn's check_no_attributes_set_in_init fails with the second solution, but not with the first.

0 replies

jeromedockes · 2023-11-02T09:53:23Z

jeromedockes
Nov 2, 2023
Maintainer Author

Of the options we discussed only overriding set_params seems to pass scikit-learn's check_estimator

from sklearn.base import BaseEstimator, RegressorMixin, clone
from sklearn.dummy import DummyRegressor
from sklearn.utils.estimator_checks import check_estimator
from sklearn.utils.validation import check_is_fitted
from sklearn.datasets import make_regression


class DefaultNone(RegressorMixin, BaseEstimator):
    """Default is None, create in __init__"""
    def __init__(self, regressor=None):
        self.regressor = regressor if regressor is not None else DummyRegressor()

    def fit(self, X, y):
        self.regressor_ = clone(self.regressor).fit(X, y)
        return self

    def predict(self, X):
        check_is_fitted(self, attributes=["regressor_"])
        return self.regressor_.predict(X)

    def _more_tags(self):
        return self.regressor._more_tags()


class AlwaysCloneDefault(RegressorMixin, BaseEstimator):
    """Default is an estimator, clone in __init__"""
    def __init__(self, regressor=DummyRegressor()):
        self.regressor = clone(regressor)

    def fit(self, X, y):
        self.regressor_ = clone(self.regressor).fit(X, y)
        return self

    def predict(self, X):
        check_is_fitted(self, attributes=["regressor_"])
        return self.regressor_.predict(X)

    def _more_tags(self):
        return self.regressor._more_tags()


DEFAULT_REGRESSOR = DummyRegressor()


class DefaultGlobalConstant(RegressorMixin, BaseEstimator):
    """Default is an estimator, clone in __init__ if default is passed"""
    def __init__(self, regressor=DEFAULT_REGRESSOR):
        self.regressor = (
            clone(regressor) if regressor is DEFAULT_REGRESSOR else regressor
        )

    def fit(self, X, y):
        self.regressor_ = clone(self.regressor).fit(X, y)
        return self

    def predict(self, X):
        check_is_fitted(self, attributes=["regressor_"])
        return self.regressor_.predict(X)

    def _more_tags(self):
        return self.regressor._more_tags()


class RedefineGetSetParams(RegressorMixin, BaseEstimator):
    """Default is None, None stored in __init__, override set_params"""
    def __init__(self, regressor=None):
        self.regressor = regressor
        self._regressor_params = {}

    def fit(self, X, y):
        self.regressor_ = (
            DummyRegressor() if self.regressor is None else clone(self.regressor)
        )
        params = {
            k.removeprefix("regressor__"): v for k, v in self._regressor_params.items()
        }
        self.regressor_.set_params(**params)
        self.regressor_.fit(X, y)
        return self

    def predict(self, X):
        check_is_fitted(self, attributes=["regressor_"])
        return self.regressor_.predict(X)

    def _more_tags(self):
        return DummyRegressor()._more_tags()

    def set_params(self, **parameters):
        if self.regressor is not None:
            super().set_params(**parameters)
        for param, value in parameters.items():
            if param.startswith("regressor__"):
                self._regressor_params[param] = value
            else:
                setattr(self, param, value)
        return self

    def get_params(self, deep=True):
        return super().get_params(deep) | self._regressor_params


if __name__ == "__main__":
    OK = "\033[92mOK\033[39m"
    FAIL = "\033[91mFAIL\033[39m"

    def d(est):
        return f"\033[94m{est.__name__: >25}\033[39m"

    estimator_types = [
        DefaultNone,
        AlwaysCloneDefault,
        DefaultGlobalConstant,
        RedefineGetSetParams,
    ]

    print("""
check_estimator
---------------

""")
    for est_type in estimator_types:
        try:
            check_estimator(est_type())
            print(f"{d(est_type)}: {OK}")
        except Exception as e:
            print(f"{d(est_type)}: {FAIL} {e}")

    print("""
set param
---------

""")

    X, y = make_regression()

    for est_type in estimator_types:
        try:
            est = est_type()
            est.set_params(regressor__strategy="median")
            est.fit(X, y)
            assert est.regressor_.strategy == "median"
            print(f"{d(est_type)}: {OK}")
        except Exception as e:
            print(f"{d(est_type)}: {FAIL} {e}")


check_estimator
---------------


              DefaultNone: FAIL Parameter regressor was mutated on init. All parameters must be stored unchanged.
       AlwaysCloneDefault: FAIL Cannot clone object AlwaysCloneDefault(), as the constructor either does not set or modifies parameter regressor
    DefaultGlobalConstant: FAIL Parameter 'regressor' of estimator 'DefaultGlobalConstant' is of type DummyRegressor which is not allowed. 'regressor' must be a callable or must be of type {'tuple', 'floating', 'bool_', 'float', 'complex128', 'str', 'bytes_', 'int', 'uint32', 'float16', 'int64', 'longdouble', 'complex64', 'object_', 'signedinteger', 'bool', 'float32', 'NoneType', 'timedelta64', 'int32', 'ulonglong', 'longlong', 'uint16', 'inexact', 'datetime64', 'uint8', 'void', 'complexfloating', 'character', 'type', 'uint64', 'float64', 'unsignedinteger', 'number', 'str_', 'flexible', 'integer', 'clongdouble', 'generic', 'int16', 'int8'}.
     RedefineGetSetParams: OK

set param
---------


              DefaultNone: OK
       AlwaysCloneDefault: OK
    DefaultGlobalConstant: OK
     RedefineGetSetParams: OK

7 replies

glemaitre Nov 2, 2023
Maintainer

DEFAULT_REGRESSOR_PARAMS could be defined using inspect.signature and locally tweak for the estimator.

glemaitre Nov 2, 2023
Maintainer

But it might be closed to DefaultNone then.

jeromedockes Nov 2, 2023
Maintainer Author

yes to me this solution is quite close to DefaultNone, just using a class instead of None as a sentinel value. note it also does not pass the check_estimator: " Parameter regressor was mutated on init. All parameters must be stored unchanged."

glemaitre Nov 2, 2023
Maintainer

Yep but this is not as bad as the parameter validation that is at run time. This check is imposed specifically because we don't want to have any changes happening in __init__ that would not be mirror in set_params.

Since we know that the tweak happening is related to a clone, we are not scikit-learn compliant but safe enough.

jeromedockes Nov 2, 2023
Maintainer Author

I agree, although in this case I would say that None is probably a less misleading value than the class

Vincent-Maladiere · 2023-11-02T11:18:15Z

Vincent-Maladiere
Nov 2, 2023
Maintainer

Thank you very much for this study @jeromedockes. I guess we can say that our two current contenders are DefaultGlobalConstant and RedefineGetSetParams.

DefaultGlobalConstant requires untoggling at least one scikit-learn estimator check, making the TV not 100% scikit-learn compatible (and maybe we're ok with that?)
RedefineGetSetParams is compatible but requires playing with get and set params, which @GaelVaroquaux doesn't favor as a solution.

For the sake of argument, I don't believe we should be 100% scikit-learn compatible, nor should it be our objective. So my +1 goes to DefaultGlobalConstant, which I find more readable and less verbose than RedefineGetSetParams. I'm +0.5 toward RedefineGetSetParams though.

0 replies

jeromedockes · 2023-11-10T12:29:56Z

jeromedockes
Nov 10, 2023
Maintainer Author

BTW there is no real difference between using None (or any other special value) as a default vs using the global variable as we decided: they do the same thing and they fail the same scikit-learn check.
Both approaches fail here (checking that the default argument is equal to the attribute); in addition the version that does not use None also fails earlier (checking that the default argument has one of a list of allowed types).

The only difference between the 2 is how the default argument is displayed in the source code and in the rendered documentation.
I still think the choice we made is reasonable as it shows the default estimator class and its non-default attributes in the doc; but if at some point we feel that is too verbose, there is no downside to going back to None (or any special value with a __repr__ we choose).

(The only approach that respects the "attribute is equal to default argument" scikit-learn convention is redefining set_params; from that point of view the others are equivalent. Also it's easy enough --but useless-- to "fool" check_params by adding __eq__ and __call__ methods to an object used as a sentinel value)

0 replies

jeromedockes · 2024-06-13T08:35:43Z

jeromedockes
Jun 13, 2024
Maintainer Author

we've reached a decsion on this and implemented it a while ago

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Setting sub-estimator parameters (eg in GridSearchCV) #796

{{title}}

Replies: 9 comments 7 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Setting sub-estimator parameters (eg in GridSearchCV) #796

jeromedockes Oct 16, 2023 Maintainer

Create the transformer in __init__ rather than in fit

Use a transformer as default value and clone it during fit

Replies: 9 comments · 7 replies

LeoGrin Oct 25, 2023 Collaborator

jeromedockes Oct 25, 2023 Maintainer Author

GaelVaroquaux Oct 25, 2023 Maintainer

GaelVaroquaux Oct 25, 2023 Maintainer

LeoGrin Oct 25, 2023 Collaborator

jeromedockes Nov 2, 2023 Maintainer Author

glemaitre Nov 2, 2023 Maintainer

glemaitre Nov 2, 2023 Maintainer

jeromedockes Nov 2, 2023 Maintainer Author

glemaitre Nov 2, 2023 Maintainer

jeromedockes Nov 2, 2023 Maintainer Author

Vincent-Maladiere Nov 2, 2023 Maintainer

jeromedockes Nov 10, 2023 Maintainer Author

jeromedockes Jun 13, 2024 Maintainer Author

jeromedockes
Oct 16, 2023
Maintainer

Create the transformer in `init` rather than in `fit`

Use a transformer as default value and clone it during `fit`

Replies: 9 comments 7 replies

LeoGrin
Oct 25, 2023
Collaborator

jeromedockes
Oct 25, 2023
Maintainer Author

GaelVaroquaux
Oct 25, 2023
Maintainer

GaelVaroquaux
Oct 25, 2023
Maintainer

LeoGrin
Oct 25, 2023
Collaborator

jeromedockes
Nov 2, 2023
Maintainer Author

glemaitre Nov 2, 2023
Maintainer

glemaitre Nov 2, 2023
Maintainer

jeromedockes Nov 2, 2023
Maintainer Author

glemaitre Nov 2, 2023
Maintainer

jeromedockes Nov 2, 2023
Maintainer Author

Vincent-Maladiere
Nov 2, 2023
Maintainer

jeromedockes
Nov 10, 2023
Maintainer Author

jeromedockes
Jun 13, 2024
Maintainer Author