-
Notifications
You must be signed in to change notification settings - Fork 52
Description
Let's assume we are working with variable length inputs. One of the strongest parts in using tf.data.Dataset is the ability to pad batches as they come.
But since scikit-learn's API is mainly focused around dataframes and arrays, incorporating this is kind of hard. Obviously, you can pad everything, but this can be a huge waste of memory. I'm trying to work with the sklearn.pipeline.Pipeline object, and I thought to myself "alright, I'll just create a custom transformer at the end of my pipeline just before the model, and make it return a tf.data.Dataset object to later plug in my model. But this is not possible since the .transform signature only accepts X and not y, while you'll need both to work with tf.data.Dataset.
So assume we have 4 features for each data point, and each has it's own sequence length, for example a datapoint might look like this:
sample_features = {'a': [1,2,3], 'b': [1,2,3,4,5], 'c': 1, 'd': [1,2]}
sample_label = 0How will I be able to manage this kind of dataset under scikit learn + scikeras?