Skip to content

Commit fd62b82

Browse files
committed
Use dicts, add more examples
1 parent f918966 commit fd62b82

File tree

4 files changed

+274
-62
lines changed

4 files changed

+274
-62
lines changed

docs/source/notebooks/DataTransformers.md

Lines changed: 228 additions & 38 deletions
Original file line numberDiff line numberDiff line change
@@ -5,8 +5,8 @@ jupyter:
55
text_representation:
66
extension: .md
77
format_name: markdown
8-
format_version: '1.2'
9-
jupytext_version: 1.9.1
8+
format_version: '1.3'
9+
jupytext_version: 1.10.1
1010
kernelspec:
1111
display_name: Python 3
1212
language: python
@@ -438,7 +438,7 @@ clf = MultiDimensionalClassifier(
438438
Train and score the model (this takes some time)
439439

440440
```python
441-
clf.fit(x_train, y_train)
441+
_ = clf.fit(x_train, y_train)
442442
```
443443

444444
```python
@@ -448,9 +448,13 @@ print(f"Test score (accuracy): {score:.2f}")
448448

449449
## 5. Ragged datasets with tf.data.Dataset
450450

451-
SciKeras provides a third dependency injection point that operates on the entire dataset: X, y & sample_weight. This `dataset_transformer` is applied after `target_transformer` and `feature_transformer`. One use case for this dependency injection point is to transform data from tabular/array-like to the `tf.data.Dataset` format, which only requires iteration. We can use this to create a `tf.data.Dataset` of ragged tensors.
451+
SciKeras provides a third dependency injection point that operates on the entire dataset: X, y & sample_weight.
452+
This `dataset_transformer` is applied after `target_transformer` and `feature_transformer`.
453+
One use case for this dependency injection point is to transform data from tabular/array-like to the `tf.data.Dataset` format, which only requires iteration.
454+
We can use this to create a `tf.data.Dataset` of ragged tensors.
452455

453-
Note that `dataset_transformer` should accept a single **3 element tuple** as its argument and return value; more details on this are in the [docs](https://www.adriangb.com/scikeras/refs/heads/master/advanced.html#data-transformers).
456+
Note that `dataset_transformer` should accept a single single dictionary as its argument to `transform` and `fit`, and return a single dictionary as well.
457+
More details on this are in the [docs](https://www.adriangb.com/scikeras/refs/heads/master/advanced.html#data-transformers).
454458

455459
Let's start by defining our data. We'll have an extra "feature" that marks the observation index, but we'll remove it when we deconstruct our data in the transformer.
456460

@@ -469,47 +473,64 @@ Also note that `dataset_transformer` will _always_ be called with `X` (i.e. the
469473
you should check if `y` and `sample_weigh` are None before doing any operations on them.
470474

471475
```python
472-
from typing import Tuple, Optional
476+
from typing import Dict, Any
473477

474478
import tensorflow as tf
475479

476480

477-
def ragged_transformer(data: Tuple[np.ndarray, Optional[np.ndarray], Optional[np.ndarray]]) -> Tuple[tf.RaggedTensor, None, None]:
478-
X, y, sample_weights = data
481+
def ragged_transformer(data: Dict[str, Any]) -> Dict[str, Any]:
482+
x, y, sample_weight = data["x"], data.get("y", None), data.get("sample_weight", None)
479483
if y is not None:
480484
y = y.reshape(-1, 1 if len(y.shape) == 1 else y.shape[1])
481-
y = y[tf.RaggedTensor.from_value_rowids(y, X[:, -1]).row_starts().numpy()]
482-
if sample_weights is not None:
483-
sample_weights = sample_weights.reshape(-1, 1 if len(sample_weights.shape) == 1 else sample_weights.shape[1])
484-
sample_weights = sample_weights[tf.RaggedTensor.from_value_rowids(sample_weights, X[:, -1]).row_starts().numpy()]
485-
X = tf.RaggedTensor.from_value_rowids(X[:, :-1], X[:, -1])
486-
return (X, y, sample_weights)
487-
```
488-
489-
In this case, we chose to keep `y` and `sample_weights` as numpy arrays, which will allow us to re-use ClassWeightDataTransformer,
485+
y = y[tf.RaggedTensor.from_value_rowids(y, x[:, -1]).row_starts().numpy()]
486+
if sample_weight is not None:
487+
sample_weight = sample_weight.reshape(-1, 1 if len(sample_weight.shape) == 1 else sample_weight.shape[1])
488+
sample_weight = sample_weight[tf.RaggedTensor.from_value_rowids(sample_weight, x[:, -1]).row_starts().numpy()]
489+
x = tf.RaggedTensor.from_value_rowids(x[:, :-1], x[:, -1])
490+
data["x"] = x
491+
if "y" in data:
492+
data["y"] = y
493+
if "sample_weight" in data:
494+
data["sample_weight"] = sample_weight
495+
return data
496+
```
497+
498+
In this case, we chose to keep `y` and `sample_weight` as numpy arrays, which will allow us to re-use ClassWeightDataTransformer,
490499
the default `dataset_transformer` for `KerasClassifier`.
491500

492501
Lets quickly test our transformer:
493502

494503
```python
495-
data = ragged_transformer((X, y, None))
496-
data
504+
data = ragged_transformer(dict(x=X, y=y, sample_weight=None))
505+
print(type(data["x"]))
506+
print(data["x"].shape)
497507
```
498508

509+
And the `y=None` case:
510+
499511
```python
500-
data = ragged_transformer((X, None, None))
501-
data
512+
data = ragged_transformer(dict(x=X, y=None, sample_weight=None))
513+
print(type(data["x"]))
514+
print(data["x"].shape)
502515
```
503516

504-
Our shapes look good, and we can handle the `y=None` case.
517+
Everything looks good!
505518

506519
Because Keras will not accept a RaggedTensor directly, we will need to wrap our entire dataset into a tensorflow `Dataset`. We can do this by adding one more transformation step:
507520

508521
Next, we can add our transormers to our model. We use an sklearn `Pipeline` (generated via `make_pipeline`) to keep ClassWeightDataTransformer operational while implementing our custom transformation.
509522

510523
```python
511-
def dataset_transformer(data: Tuple[np.ndarray, Optional[np.ndarray], Optional[np.ndarray]]) -> Tuple[tf.data.Dataset, None, None]:
512-
return (tf.data.Dataset.from_tensor_slices(data), None, None)
524+
def dataset_transformer(data: Dict[str, Any]) -> Dict[str, Any]:
525+
x_y_s = data["x"], data.get("y", None), data.get("sample_weight", None)
526+
data["x"] = tf.data.Dataset.from_tensor_slices(x_y_s)
527+
# don't blindly assign y & sw; if being called from
528+
# predict they should not just be None, they should not be present at all!
529+
if "y" in data:
530+
data["y"] = None
531+
if "sample_weight" in data:
532+
data["sample_weight"] = None
533+
return data
513534
```
514535

515536
```python
@@ -603,7 +624,8 @@ y_pred
603624

604625
## 6. Multi-output class_weight
605626

606-
In this example, we will use `dataset_transformer` to support multi-output class weights. We will re-use our `MultiOutputTransformer` from our previous example to split the output, then we will create `sample_weights` from `class_weight`
627+
In this example, we will use `dataset_transformer` to support multi-output class weights.
628+
We will re-use our `MultiOutputTransformer` from our previous example to split the output, then we will create `sample_weight` from `class_weight`.
607629

608630
```python
609631
from collections import defaultdict
@@ -614,36 +636,36 @@ from sklearn.utils.class_weight import compute_sample_weight
614636

615637
class DatasetTransformer(BaseEstimator, TransformerMixin):
616638

617-
def __init__(self, output_names, class_weight=None):
618-
self.class_weight = class_weight
639+
def __init__(self, output_names):
619640
self.output_names = output_names
620641

621-
def fit(self, data: Tuple[np.ndarray, Optional[np.ndarray], Optional[np.ndarray]]) -> "DatasetTransformer":
642+
def fit(self, data: Dict[str, Any]) -> "DatasetTransformer":
622643
return self
623644

624-
def transform(self, data: Tuple[np.ndarray, Optional[np.ndarray], Optional[np.ndarray]]) -> Tuple[np.ndarray, Union[np.ndarray, None], Union[np.ndarray, None]]:
625-
if self.class_weight is None:
645+
def transform(self, data: Dict[str, Any]) -> Dict[str, Any]:
646+
class_weight = data.get("class_weight", None)
647+
if class_weight is None:
626648
return data
627-
class_weight = self.class_weight
628649
if isinstance(class_weight, str): # handle "balanced"
629650
class_weight_ = class_weight
630651
class_weight = defaultdict(lambda: class_weight_)
631-
X, y, sample_weights = data
632-
assert sample_weights is None, "Cannot use class_weight & sample_weights together"
652+
y, sample_weight = data.get("y", None), data.get("sample_weight", None)
653+
assert sample_weight is None, "Cannot use class_weight & sample_weight together"
633654
if y is not None:
634655
# y should be a list of arrays, as split up by MultiOutputTransformer
635-
sample_weights = {
656+
sample_weight = {
636657
output_name: compute_sample_weight(class_weight[output_num], output_data)
637658
for output_num, (output_name, output_data) in enumerate(zip(self.output_names, y))
638659
}
639660
# Note: class_weight is expected to be indexable by output_number in sklearn
640661
# see https://scikit-learn.org/stable/modules/generated/sklearn.utils.class_weight.compute_sample_weight.html
641662
# It is trivial to change the expected format to match Keras' ({output_name: weights, ...})
642663
# see https://github.com/keras-team/keras/issues/4735#issuecomment-267473722
643-
return X, y, sample_weights
644-
```
664+
data["sample_weight"] = sample_weight
665+
data["class_weight"] = None
666+
return data
667+
645668

646-
```python
647669
def get_model(meta, compile_kwargs):
648670
inp = keras.layers.Input(shape=(meta["n_features_in_"]))
649671
x1 = keras.layers.Dense(100, activation="relu")(inp)
@@ -667,7 +689,6 @@ class CustomClassifier(KerasClassifier):
667689
def dataset_transformer(self):
668690
return DatasetTransformer(
669691
output_names=self.model_.output_names,
670-
class_weight=self.class_weight
671692
)
672693
```
673694

@@ -731,3 +752,172 @@ print(counts_bin)
731752
(_, counts_cat) = np.unique(y_pred[:, 1], return_counts=True)
732753
print(counts_cat)
733754
```
755+
756+
## 6. Custom validation dataset
757+
758+
Although `dataset_transformer` is primarily designed for data transformations, because it returns valid `**kwargs` to fit it can be used for other advanced use cases.
759+
In this example, we use `dataset_transformer` to implement a custom test/train split for Keras' internal validation. We'll use sklearn's
760+
`train_test_split`, but this could be implemented via an arbitrary user function, eg. to ensure balanced class distribution.
761+
762+
```python
763+
from sklearn.model_selection import train_test_split
764+
765+
766+
def get_clf(meta: Dict[str, Any]):
767+
inp = keras.layers.Input(shape=(meta["n_features_in_"],))
768+
x1 = keras.layers.Dense(100, activation="relu")(inp)
769+
out = keras.layers.Dense(1, activation="sigmoid")(x1)
770+
return keras.Model(inputs=inp, outputs=out)
771+
772+
773+
class CustomSplit(BaseEstimator, TransformerMixin):
774+
775+
def __init__(self, test_size: float):
776+
self.test_size = test_size
777+
778+
def fit(self, data: Dict[str, Any]) -> "CustomSplit":
779+
return self
780+
781+
def transform(self, data: Dict[str, Any]) -> Dict[str, Any]:
782+
if self.test_size == 0:
783+
return data
784+
x, y, sw = data["x"], data.get("y", None), data.get("sample_weight", None)
785+
if y is None:
786+
return data
787+
if sw is None:
788+
x_train, x_val, y_train, y_val = train_test_split(x, y, test_size=self.test_size, stratify=y)
789+
validation_data = (x_val, y_val)
790+
sw_train = None
791+
else:
792+
x_train, x_val, y_train, y_val, sw_train, sw_val = train_test_split(x, y, sw, test_size=self.test_size, stratify=y)
793+
validation_data = (x_val, y_val, sw_val)
794+
data["validation_data"] = validation_data
795+
data["x"], data["y"], data["sample_weight"] = x_train, y_train, sw_train
796+
return data
797+
798+
799+
class CustomClassifier(KerasClassifier):
800+
801+
@property
802+
def dataset_transformer(self):
803+
return CustomSplit(test_size=self.validation_split)
804+
```
805+
806+
And now lets test with a toy dataset. We specifically choose to make the target strings to show
807+
that with this approach, we can preserve all of the nice data pre-processing that SciKeras does
808+
for us, while still being able to split the final data before passing it to Keras.
809+
810+
```python
811+
y = np.array(["a"] * 900 + ["b"] * 100)
812+
X = np.array([0] * 900 + [1] * 100).reshape(-1, 1)
813+
```
814+
815+
To get a base measurment to compare against, we'll run first with KerasClassifier as a benchmark.
816+
817+
```python
818+
clf = KerasClassifier(
819+
get_clf,
820+
loss="bce",
821+
metrics=["binary_accuracy"],
822+
verbose=False,
823+
validation_split=0.1,
824+
shuffle=False,
825+
random_state=0,
826+
epochs=10
827+
)
828+
829+
clf.fit(X, y)
830+
print(f"binary_accuracy = {clf.history_['binary_accuracy'][-1]}")
831+
print(f"val_binary_accuracy = {clf.history_['val_binary_accuracy'][-1]}")
832+
```
833+
834+
We see that we get near zero validation accuracy. Because one of our classes was only found in the tail end of our dataset and we specified `validation_split=0.1`, we validated with a class we had never seen before.
835+
836+
We could specify `shuffle=True` (this is actually the default), but for highly imbalanced classes, this may not be as good as stratified splitting.
837+
838+
So lets test our new `CustomClassifier`.
839+
840+
```python
841+
clf = CustomClassifier(
842+
get_clf,
843+
loss="bce",
844+
metrics=["binary_accuracy"],
845+
verbose=False,
846+
validation_split=0.1,
847+
shuffle=False,
848+
random_state=0,
849+
epochs=10
850+
)
851+
852+
clf.fit(X, y)
853+
print(f"binary_accuracy = {clf.history_['binary_accuracy'][-1]}")
854+
print(f"val_binary_accuracy = {clf.history_['val_binary_accuracy'][-1]}")
855+
```
856+
857+
Much better!
858+
859+
860+
## 7. Dynamically setting batch_size
861+
862+
863+
In this tutorial, we use the `data_transformer` interface to implement a dynamic batch_size, similar to sklearn's [MLPClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.neural_network.MLPClassifier.html). We will implement `batch_size` as `batch_size=min(200, n_samples)`.
864+
865+
```python
866+
from sklearn.model_selection import train_test_split
867+
868+
869+
def check_batch_size(x):
870+
"""Check the batch_size used in training.
871+
"""
872+
bs = x.shape[0]
873+
if bs is not None:
874+
print(f"batch_size={bs}")
875+
return x
876+
877+
878+
def get_clf(meta: Dict[str, Any]):
879+
inp = keras.layers.Input(shape=(meta["n_features_in_"],))
880+
x1 = keras.layers.Dense(100, activation="relu")(inp)
881+
x2 = keras.layers.Lambda(check_batch_size)(x1)
882+
out = keras.layers.Dense(1, activation="sigmoid")(x2)
883+
return keras.Model(inputs=inp, outputs=out)
884+
885+
886+
class DynamicBatch(BaseEstimator, TransformerMixin):
887+
888+
def fit(self, data: Dict[str, Any]) -> "DynamicBatch":
889+
return self
890+
891+
def transform(self, data: Dict[str, Any]) -> Dict[str, Any]:
892+
n_samples = data["x"].shape[0]
893+
data["batch_size"] = min(200, n_samples)
894+
return data
895+
896+
897+
class DynamicBatchClassifier(KerasClassifier):
898+
899+
@property
900+
def dataset_transformer(self):
901+
return DynamicBatch()
902+
```
903+
904+
Since this is happening inside SciKeras, this will work even if we are doing cross validation (which adjusts the split according to `cv`).
905+
906+
```python
907+
from sklearn.model_selection import cross_val_score
908+
909+
clf = DynamicBatchClassifier(
910+
get_clf,
911+
loss="bce",
912+
verbose=False,
913+
random_state=0
914+
)
915+
916+
_ = cross_val_score(clf, X, y, cv=6) # note: 1000 / 6 = 167
917+
```
918+
919+
But if we train with larger inputs, we can hit the cap of 200 we set:
920+
921+
```python
922+
_ = cross_val_score(clf, X, y, cv=5)
923+
```

scikeras/utils/transformers.py

Lines changed: 6 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -408,18 +408,15 @@ def __init__(self, class_weight: Optional[Union[str, Dict[int, float]]] = None):
408408
self.class_weight = class_weight
409409

410410
def fit(
411-
self,
412-
data: Tuple[np.ndarray, Optional[np.ndarray], Optional[np.ndarray]],
413-
dummy: None = None,
411+
self, data: Dict[str, Any], dummy: None = None
414412
) -> "ClassWeightDataTransformer":
415413
return self
416414

417-
def transform(
418-
self, data: Tuple[np.ndarray, Optional[np.ndarray], Optional[np.ndarray]]
419-
) -> Tuple[np.ndarray, Union[np.ndarray, None], Union[np.ndarray, None]]:
420-
X, y, sample_weight = data
415+
def transform(self, data: Dict[str, Any]) -> Dict[str, Any]:
416+
y, sample_weight = data.get("y", None), data.get("sample_weight", None)
421417
if self.class_weight is None or y is None:
422-
return (X, y, sample_weight)
418+
return data
423419
sample_weight = 1 if sample_weight is None else sample_weight
424420
sample_weight *= compute_sample_weight(class_weight=self.class_weight, y=y)
425-
return (X, y, sample_weight)
421+
data["sample_weight"] = sample_weight
422+
return data

0 commit comments

Comments
 (0)