Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
50 commits
Select commit Hold shift + click to select a range
1d718ff
ENH: add dependancy injection point to transform X & y together
adriangb Jan 16, 2021
1a6037e
Merge branch 'master' into whole-dataset-transformer
adriangb Jan 16, 2021
28535b2
Merge branch 'master' into whole-dataset-transformer
adriangb Jan 19, 2021
c170f4b
Extend data transformer notebook with examples of data_transformer usage
adriangb Jan 21, 2021
cd5f415
Merge branch 'whole-dataset-transformer' of https://github.com/adrian…
adriangb Jan 21, 2021
b7fb34c
run entire notebook
adriangb Jan 21, 2021
d3357e4
Merge branch 'master' into whole-dataset-transformer
adriangb Jan 21, 2021
bc92cff
Update docstring
adriangb Jan 22, 2021
45887e4
Merge branch 'whole-dataset-transformer' of https://github.com/adrian…
adriangb Jan 22, 2021
5b8e133
typo
adriangb Jan 22, 2021
2f2b7a5
Merge branch 'master' into whole-dataset-transformer
adriangb Jan 24, 2021
6ee6425
Test pipeline, move notebook to markdown
adriangb Jan 24, 2021
a3092c2
fix undef transformer
adriangb Jan 24, 2021
8f92591
remove unused dummy transformer
adriangb Jan 24, 2021
fa728c1
Remove unused import
adriangb Jan 24, 2021
6fdea0d
remove empty cell
adriangb Jan 24, 2021
8aba7cb
Merge branch 'master' into whole-dataset-transformer
adriangb Jan 24, 2021
6675889
Fix typos
adriangb Jan 24, 2021
c317699
Merge branch 'whole-dataset-transformer' of https://github.com/adrian…
adriangb Jan 24, 2021
5acbd0f
add comment
adriangb Jan 24, 2021
5d9e02b
print all data
adriangb Jan 24, 2021
9b43e9c
Update data transformer docs
adriangb Jan 24, 2021
0fbecd0
Merge branch 'master' into whole-dataset-transformer
adriangb Jan 24, 2021
deb4858
Finish sentence
adriangb Jan 24, 2021
981e61c
PR feedback
adriangb Jan 25, 2021
0d55306
fix error
adriangb Jan 25, 2021
3cf1ed5
use embedded links, ref links seem to be broken
adriangb Jan 25, 2021
a198eb3
spacing
adriangb Jan 25, 2021
047d430
fix code block
adriangb Jan 25, 2021
5413015
Merge branch 'master' into whole-dataset-transformer
adriangb Jan 27, 2021
e71625e
Merge branch 'master' into whole-dataset-transformer
adriangb Jan 27, 2021
f4c0dcc
Merge branch 'master' into whole-dataset-transformer
adriangb Jan 27, 2021
54cfc43
PR feedback
adriangb Jan 27, 2021
1742ef4
Merge branch 'whole-dataset-transformer' of https://github.com/adrian…
adriangb Jan 27, 2021
d03248f
use code block for signature
adriangb Jan 27, 2021
87452ff
remove dummy parameter
adriangb Jan 27, 2021
d2b4402
Merge branch 'master' into whole-dataset-transformer
adriangb Jan 28, 2021
034fc7f
Merge branch 'master' into whole-dataset-transformer
adriangb Jan 29, 2021
491e0b1
Merge branch 'master' into whole-dataset-transformer
adriangb Jan 31, 2021
3f8f9b4
re-add dummy
adriangb Jan 31, 2021
f569b48
Merge branch 'master' into whole-dataset-transformer
adriangb Jan 31, 2021
8dafe1b
Merge branch 'whole-dataset-transformer' of https://github.com/adrian…
adriangb Jan 31, 2021
f918966
Merge master
adriangb Feb 16, 2021
fd62b82
Use dicts, add more examples
adriangb Feb 16, 2021
6eee3c4
fix broken test
adriangb Feb 16, 2021
5ca7da8
update docs
adriangb Feb 16, 2021
5bd222e
add clarifying comment in docs
adriangb Feb 16, 2021
f560687
update TOC
adriangb Feb 16, 2021
2353408
Merge branch 'master' into whole-dataset-transformer
adriangb Feb 16, 2021
f5df4c4
Merge branch 'master' into whole-dataset-transformer
adriangb Feb 20, 2021
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
154 changes: 140 additions & 14 deletions docs/source/advanced.rst
Original file line number Diff line number Diff line change
Expand Up @@ -178,26 +178,114 @@ This is basically the same as calling :py:func:`~scikeras.wrappers.BaseWrapper.g
Data Transformers
^^^^^^^^^^^^^^^^^

In some cases, the input actually consists of multiple inputs. E.g.,
Keras supports a much wider range of inputs/outputs than Scikit-Learn does. E.g.,
in a text classification task, you might have an array that contains
the integers representing the tokens for each sample, and another
array containing the number of tokens of each sample. SciKeras has you
covered here as well.
array containing the number of tokens of each sample.

In order to reconcile Keras' expanded input/output support and Scikit-Learn's more
limited options, SciKeras introduces "data transformers". These are really just
dependency injection points where you can declare custom data transformations,
for example to split an array into a list of arrays, join ``X`` & ``y`` into a ``Dataset``, etc.
In order to keep these transformations in a familiar format, they are implemented as
sklearn-style transformers. You can think of this setup as an sklearn Pipeline:

.. code-block::

↗ feature_encoder ↘
SciKeras.fit(features, labels) dataset_transformer → keras.Model.fit(data)
↘ target_encoder ↗


Within SciKeras, this is roughly implemented as follows:

.. code:: python

class PseudoBaseWrapper:

def fit(self, X, y, sample_weight):
self.target_encoder_ = self.target_encoder.fit(X)
X = self.feature_encoder_.transform(X)
self.feature_encoder_ = self.feature_encoder.fit(y)
y = self.target_encoder_.transform(y)
self.model_ = self._build_keras_model()
fit_kwargs = dict(x=X, y=y, sample_weight=sample_weight)
self.dataset_transformer_ = self.dataset_transformer.fit(fit_kwargs)
fit_kwargs = self.dataset_transformer_.transform(fit_kwargs)
self.model_.fit(x=X, y=y, sample_weight=sample_weight) # tf.keras.Model.fit
return self

def predict(self, X):
X = self.feature_encoder_.transform(X)
predict_kwargs = dict(x=X)
predict_kwargs = self.dataset_transformer_.fit_transform(predict_kwargs)
y_pred = self.model_.predict(**predict_kwargs)
return self.target_encoder_.inverse_transform(y_pred)


``dataset_transformer`` is the last step before passing the data to Keras, and it allows for the greatest
degree of customization because SciKeras does not make any assumptions about the output data
and passes it directly to :py:func:`tensorflow.keras.Model.fit`.

It accepts a dict of valid Keras ``**kwargs`` and is expected to return a dict
of valid Keras ``**kwargs``:

.. code:: python

from sklearn.base import BaseEstimator, TransformerMixin

class DatasetTransformer(BaseEstimator, TransformerMixin):
def fit(self, data: Dict[str, Any]) -> "DatasetTransformer":
assert data.keys() == {"x", "y", "sample_weight"} # fixed keys
...
return self

def transform(self, data): # return a valid input for keras.Model.fit
# data includes x, y, sample_weight
assert "x" in data # "x" is always a keys
if "y" in data:
# called from fit
else:
# called from predict
# as well as other Model.fit or Model.predict arguments
assert "batch_size" in data
...
return data


You can modify ``data`` in-place within ``transoform`` but **must** still return
it.

When called from ``fit`` or ``initialize``, you will get and return keys that are valid
``**kwargs`` to ``tf.keras.Model.fit``. When being called from ``predict`` or ``score``
you will get and return keys that are valid ``**kwargs`` to ``tf.keras.Model.predict``.

Although you could implement *all* data transformations in a single ``dataset_transformer``,
having several distinct dependency injections points allows for more modularity,
for example to keep the default processing of string-encoded labels but convert
the data to a :py:func:`tensorflow.data.Dataset` before passing to Keras.

For a complete examples implementing custom data processing, see the examples in the
:ref:`tutorials` section.

Multi-input and output models via feature_encoder and target_encoder
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

Scikit-Learn natively supports multiple outputs, although it technically
requires them to be arrays of equal length
(see docs for Scikit-Learn's :py:class:`~sklearn.multioutput.MultiOutputClassifier`).
Scikit-Learn has no support for multiple inputs.
To work around this issue, SciKeras implements a data conversion
abstraction in the form of Scikit-Learn style transformers,
one for ``X`` (features) and one for ``y`` (target).
By implementing a custom transformer, you can split a single input ``X`` into multiple inputs
for :py:class:`tensorflow.keras.Model` or perform any other manipulation you need.
one for ``X`` (features) and one for ``y`` (target). These are implemented
via :py:func:`scikeras.wrappers.BaseWrappers.feature_encoder` and
:py:func:`scikeras.wrappers.BaseWrappers.feature_encoder` respectively.

To override the default transformers, simply override
:py:func:`scikeras.wrappers.BaseWrappers.target_encoder` or
:py:func:`scikeras.wrappers.BaseWrappers.feature_encoder` for ``y`` and ``X`` respectively.

SciKeras uses :py:func:`sklearn.utils.multiclass.type_of_target` to categorize the target
By default, SciKeras uses :py:func:`sklearn.utils.multiclass.type_of_target` to categorize the target
type, and implements basic handling of the following cases out of the box:

+--------------------------+--------------+----------------+----------------+---------------+
Expand All @@ -208,11 +296,11 @@ type, and implements basic handling of the following cases out of the box:
+--------------------------+--------------+----------------+----------------+---------------+
| "binary" | [1, 0, 1] | 1 | 1 or 2 | Yes |
+--------------------------+--------------+----------------+----------------+---------------+
| "mulilabel-indicator" | [[1, 1], | 1 or >1 | 2 per target | Single output |
| "multilabel-indicator" | [[1, 1], | 1 or >1 | 2 per target | Single output |
| | | | | |
| | [0, 2], | | | only |
| | [0, 1], | | | only |
| | | | | |
| | [1, 1]] | | | |
| | [1, 0]] | | | |
+--------------------------+--------------+----------------+----------------+---------------+
| "multiclass-multioutput" | [[1, 1], | >1 | >=2 per target | No |
| | | | | |
Expand All @@ -229,11 +317,47 @@ type, and implements basic handling of the following cases out of the box:
| | [.2, .9]] | | | |
+--------------------------+--------------+----------------+----------------+---------------+

If you find that your target is classified as ``"multiclass-multioutput"`` or ``"unknown"``, you will have to
implement your own data processing routine.
The supported cases are handled by the default implementation of ``target_encoder``.
The default implementations are available for use as :py:class:`scikeras.utils.transformers.ClassifierLabelEncoder`
and :py:class:`scikeras.utils.transformers.RegressorTargetEncoder` for
:py:class:`scikeras.wrappers.KerasClassifier` and :py:class:`scikeras.wrappers.KerasRegressor` respectively.

For a complete examples implementing custom data processing, see the examples in the
:ref:`tutorials` section.
As per the table above, if you find that your target is classified as
``"multiclass-multioutput"`` or ``"unknown"``, you will have to implement your own data processing routine.

get_metadata method
+++++++++++++++++++

In addition to converting data, ``feature_encoder`` and ``target_encoder``, allows you to inject data
into your model construction method. This is useful if for example you use ``target_encoder`` to dynamically
determine how many outputs your model should have based on the data and then use this information to
assign the right number of outputs in your Model. To return data from ``feature_encoder`` or ``target_encoder``,
you will need to provide a transformer with a ``get_metadata`` method, which is expected to return a dictionary
which will be injected into your model building function via the ``meta`` parameter.

For example, if you wanted to create a calculated parameter called ``my_param_``:

.. code-block::python

class MultiOutputTransformer(BaseEstimator, TransformerMixin):
def get_metadata(self):
return {"my_param_": "foobarbaz"}

class MultiOutputClassifier(KerasClassifier):

@property
def target_encoder(self):
return MultiOutputTransformer(...)

def get_model(meta):
print(f"Got: {meta['my_param_']}")

clf = MultiOutputClassifier(model=get_model)
clf.fit(X, y) # prints 'Got: foobarbaz'
print(clf.my_param_) # prints 'foobarbaz'

Note that it is best practice to end your parameter names with a single underscore,
which allows sklearn to know which parameters are stateful and which are stateless.

Routed parameters
-----------------
Expand Down Expand Up @@ -282,6 +406,8 @@ Custom Scorers
SciKeras uses :func:`sklearn.metrics.accuracy_score` and :func:`sklearn.metrics.accuracy_score`
as the scoring functions for :class:`scikeras.wrappers.KerasClassifier`
and :class:`scikeras.wrappers.KerasRegressor` respectively. To override these scoring functions,
override :func:`scikeras.wrappers.KerasClassifier.scorer`
or :func:`scikeras.wrappers.KerasRegressor.scorer`.


.. _Keras Callbacks docs: https://www.tensorflow.org/api_docs/python/tf/keras/callbacks
Expand Down
Loading