You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
SciKeras provides a third dependency injection point that operates on the entire dataset: X, y & sample_weight. This `dataset_transformer` is applied after `target_transformer` and `feature_transformer`. One use case for this dependency injection point is to transform data from tabular/array-like to the `tf.data.Dataset` format, which only requires iteration. We can use this to create a `tf.data.Dataset` of ragged tensors.
451
+
SciKeras provides a third dependency injection point that operates on the entire dataset: X, y & sample_weight.
452
+
This `dataset_transformer` is applied after `target_transformer` and `feature_transformer`.
453
+
One use case for this dependency injection point is to transform data from tabular/array-like to the `tf.data.Dataset` format, which only requires iteration.
454
+
We can use this to create a `tf.data.Dataset` of ragged tensors.
452
455
453
-
Note that `dataset_transformer` should accept a single **3 element tuple** as its argument and return value; more details on this are in the [docs](https://www.adriangb.com/scikeras/refs/heads/master/advanced.html#data-transformers).
456
+
Note that `dataset_transformer` should accept a single single dictionary as its argument to `transform` and `fit`, and return a single dictionary as well.
457
+
More details on this are in the [docs](https://www.adriangb.com/scikeras/refs/heads/master/advanced.html#data-transformers).
454
458
455
459
Let's start by defining our data. We'll have an extra "feature" that marks the observation index, but we'll remove it when we deconstruct our data in the transformer.
456
460
@@ -469,47 +473,64 @@ Also note that `dataset_transformer` will _always_ be called with `X` (i.e. the
469
473
you should check if `y` and `sample_weigh` are None before doing any operations on them.
x = tf.RaggedTensor.from_value_rowids(x[:, :-1], x[:, -1])
490
+
data["x"] = x
491
+
if"y"in data:
492
+
data["y"] = y
493
+
if"sample_weight"in data:
494
+
data["sample_weight"] = sample_weight
495
+
return data
496
+
```
497
+
498
+
In this case, we chose to keep `y` and `sample_weight` as numpy arrays, which will allow us to re-use ClassWeightDataTransformer,
490
499
the default `dataset_transformer` for `KerasClassifier`.
491
500
492
501
Lets quickly test our transformer:
493
502
494
503
```python
495
-
data = ragged_transformer((X, y, None))
496
-
data
504
+
data = ragged_transformer(dict(x=X, y=y, sample_weight=None))
505
+
print(type(data["x"]))
506
+
print(data["x"].shape)
497
507
```
498
508
509
+
And the `y=None` case:
510
+
499
511
```python
500
-
data = ragged_transformer((X, None, None))
501
-
data
512
+
data = ragged_transformer(dict(x=X, y=None, sample_weight=None))
513
+
print(type(data["x"]))
514
+
print(data["x"].shape)
502
515
```
503
516
504
-
Our shapes look good, and we can handle the `y=None` case.
517
+
Everything looks good!
505
518
506
519
Because Keras will not accept a RaggedTensor directly, we will need to wrap our entire dataset into a tensorflow `Dataset`. We can do this by adding one more transformation step:
507
520
508
521
Next, we can add our transormers to our model. We use an sklearn `Pipeline` (generated via `make_pipeline`) to keep ClassWeightDataTransformer operational while implementing our custom transformation.
# don't blindly assign y & sw; if being called from
528
+
# predict they should not just be None, they should not be present at all!
529
+
if"y"in data:
530
+
data["y"] =None
531
+
if"sample_weight"in data:
532
+
data["sample_weight"] =None
533
+
return data
513
534
```
514
535
515
536
```python
@@ -603,7 +624,8 @@ y_pred
603
624
604
625
## 6. Multi-output class_weight
605
626
606
-
In this example, we will use `dataset_transformer` to support multi-output class weights. We will re-use our `MultiOutputTransformer` from our previous example to split the output, then we will create `sample_weights` from `class_weight`
627
+
In this example, we will use `dataset_transformer` to support multi-output class weights.
628
+
We will re-use our `MultiOutputTransformer` from our previous example to split the output, then we will create `sample_weight` from `class_weight`.
607
629
608
630
```python
609
631
from collections import defaultdict
@@ -614,36 +636,36 @@ from sklearn.utils.class_weight import compute_sample_weight
Although `dataset_transformer` is primarily designed for data transformations, because it returns valid `**kwargs` to fit it can be used for other advanced use cases.
759
+
In this example, we use `dataset_transformer` to implement a custom test/train split for Keras' internal validation. We'll use sklearn's
760
+
`train_test_split`, but this could be implemented via an arbitrary user function, eg. to ensure balanced class distribution.
761
+
762
+
```python
763
+
from sklearn.model_selection import train_test_split
We see that we get near zero validation accuracy. Because one of our classes was only found in the tail end of our dataset and we specified `validation_split=0.1`, we validated with a class we had never seen before.
835
+
836
+
We could specify `shuffle=True` (this is actually the default), but for highly imbalanced classes, this may not be as good as stratified splitting.
In this tutorial, we use the `data_transformer` interface to implement a dynamic batch_size, similar to sklearn's [MLPClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.neural_network.MLPClassifier.html). We will implement `batch_size` as `batch_size=min(200, n_samples)`.
864
+
865
+
```python
866
+
from sklearn.model_selection import train_test_split
0 commit comments