forked from scikit-learn/scikit-learn
-
Notifications
You must be signed in to change notification settings - Fork 0
Open
Description
User Request
FeatureUnion not working when aggregating data and pandas transform output selected
Describe the bug
I would like to use pandas transform output and use a custom transformer in a feature union which aggregates data. When I'm using this combination I got an error. When I use default numpy output it works fine.
Steps/Code to Reproduce
import pandas as pd
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn import set_config
from sklearn.pipeline import make_union
index = pd.date_range(start="2020-01-01", end="2020-01-05", inclusive="left", freq="H")
data = pd.DataFrame(index=index, data=[10] * len(index), columns=["value"])
data["date"] = index.date
class MyTransformer(BaseEstimator, TransformerMixin):
def fit(self, X: pd.DataFrame, y: pd.Series | None = None, **kwargs):
return self
def transform(self, X: pd.DataFrame, y: pd.Series | None = None) -> pd.DataFrame:
return X["value"].groupby(X["date"]).sum()
# This works.
set_config(transform_output="default")
print(make_union(MyTransformer()).fit_transform(data))
# This does not work.
set_config(transform_output="pandas")
print(make_union(MyTransformer()).fit_transform(data))Expected Results
No error is thrown when using pandas transform output.
Actual Results
ValueError: Length mismatch: Expected axis has 4 elements, new values have 96 elements
(Full stack trace shows failure inside sklearn/utils/_set_output.py when assigning the original input index to the pandas output.)
Versions
Python 3.10.6; sklearn 1.2.1; pandas 1.4.4; numpy 1.23.5; macOS 11.3
Researcher Specification
Root Cause
- In
sklearn/utils/_set_output.py, whentransform_output="pandas",_wrap_in_pandas_containerunconditionally sets the returned pandas object's index to the original input's index:if index is not None: data_to_wrap.index = index
- If a transformer returns a pandas
DataFrame/Serieswith a different number of rows (e.g., aggregated data), assigning the original input index raisespandasValueErrordue to length mismatch. FeatureUnionandColumnTransformerusepd.concatfor pandas outputs, which aligns by index and can handle differing row counts, but the unconditional index assignment fails earlier.
Proposed Change (Minimal, Backward-Compatible)
- Modify
_wrap_in_pandas_containerto assign the original input index only when lengths match. Otherwise, preserve the transformer’s output index. - Keep column naming behavior unchanged. For ndarray outputs, only apply the provided index when lengths match; otherwise, use default index to avoid
ValueError.
Illustrative adjustment:
if isinstance(data_to_wrap, pd.DataFrame):
if columns is not None:
data_to_wrap.columns = columns
if index is not None:
try:
if len(index) == len(data_to_wrap):
data_to_wrap.index = index
except TypeError:
pass
return data_to_wrap
# For ndarray outputs
index_to_use = None
if index is not None:
try:
if len(index) == len(data_to_wrap):
index_to_use = index
except TypeError:
pass
return pd.DataFrame(data_to_wrap, index=index_to_use, columns=columns)Tests to Add/Update
- Unit tests in
sklearn/utils/tests/test_set_output.py:- Preserve index when lengths differ for pandas
DataFrame/Seriesoutputs. - Ignore provided index for ndarray outputs when lengths differ.
- Confirm alignment still happens when lengths match.
- Preserve index when lengths differ for pandas
- Integration tests in
sklearn/tests/test_pipeline.py:FeatureUnionwith an aggregate transformer undertransform_output="pandas"should not raise and should preserve the aggregated index.- Confirm that equal-length outputs still align to the original index.
- Optional:
sklearn/compose/tests/test_column_transformer.pysimilar aggregate case.
Reproduction and Observed Failure
- See code above; reproduces
ValueErrororiginating from_wrap_in_pandas_containerassigning a mismatched index to a pandas output.
Risk and Compatibility Notes
- Behavior change is scoped to
transform_output="pandas"and mismatched lengths. - Equal-length behavior remains unchanged (still aligns to original index).
- Enables aggregate-style transformers with pandas outputs to pass through and be concatenated by index, consistent with
pd.concatsemantics already used byFeatureUnion/ColumnTransformer.
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels