Skip to content

FeatureUnion: error with transform_output="pandas" when transformer aggregates (index length mismatch) #72

@rowan-stein

Description

@rowan-stein

User Request

FeatureUnion not working when aggregating data and pandas transform output selected

Describe the bug

I would like to use pandas transform output and use a custom transformer in a feature union which aggregates data. When I'm using this combination I got an error. When I use default numpy output it works fine.

Steps/Code to Reproduce

import pandas as pd
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn import set_config
from sklearn.pipeline import make_union
index = pd.date_range(start="2020-01-01", end="2020-01-05", inclusive="left", freq="H")
data = pd.DataFrame(index=index, data=[10] * len(index), columns=["value"])
data["date"] = index.date
class MyTransformer(BaseEstimator, TransformerMixin):
    def fit(self, X: pd.DataFrame, y: pd.Series | None = None, **kwargs):
        return self
    def transform(self, X: pd.DataFrame, y: pd.Series | None = None) -> pd.DataFrame:
        return X["value"].groupby(X["date"]).sum()
# This works.
set_config(transform_output="default")
print(make_union(MyTransformer()).fit_transform(data))
# This does not work.
set_config(transform_output="pandas")
print(make_union(MyTransformer()).fit_transform(data))

Expected Results

No error is thrown when using pandas transform output.

Actual Results

ValueError: Length mismatch: Expected axis has 4 elements, new values have 96 elements

(Full stack trace shows failure inside sklearn/utils/_set_output.py when assigning the original input index to the pandas output.)

Versions

Python 3.10.6; sklearn 1.2.1; pandas 1.4.4; numpy 1.23.5; macOS 11.3

Researcher Specification

Root Cause

  • In sklearn/utils/_set_output.py, when transform_output="pandas", _wrap_in_pandas_container unconditionally sets the returned pandas object's index to the original input's index:
    if index is not None:
        data_to_wrap.index = index
  • If a transformer returns a pandas DataFrame/Series with a different number of rows (e.g., aggregated data), assigning the original input index raises pandas ValueError due to length mismatch.
  • FeatureUnion and ColumnTransformer use pd.concat for pandas outputs, which aligns by index and can handle differing row counts, but the unconditional index assignment fails earlier.

Proposed Change (Minimal, Backward-Compatible)

  • Modify _wrap_in_pandas_container to assign the original input index only when lengths match. Otherwise, preserve the transformer’s output index.
  • Keep column naming behavior unchanged. For ndarray outputs, only apply the provided index when lengths match; otherwise, use default index to avoid ValueError.

Illustrative adjustment:

if isinstance(data_to_wrap, pd.DataFrame):
    if columns is not None:
        data_to_wrap.columns = columns
    if index is not None:
        try:
            if len(index) == len(data_to_wrap):
                data_to_wrap.index = index
        except TypeError:
            pass
    return data_to_wrap
# For ndarray outputs
index_to_use = None
if index is not None:
    try:
        if len(index) == len(data_to_wrap):
            index_to_use = index
    except TypeError:
        pass
return pd.DataFrame(data_to_wrap, index=index_to_use, columns=columns)

Tests to Add/Update

  • Unit tests in sklearn/utils/tests/test_set_output.py:
    • Preserve index when lengths differ for pandas DataFrame/Series outputs.
    • Ignore provided index for ndarray outputs when lengths differ.
    • Confirm alignment still happens when lengths match.
  • Integration tests in sklearn/tests/test_pipeline.py:
    • FeatureUnion with an aggregate transformer under transform_output="pandas" should not raise and should preserve the aggregated index.
    • Confirm that equal-length outputs still align to the original index.
  • Optional: sklearn/compose/tests/test_column_transformer.py similar aggregate case.

Reproduction and Observed Failure

  • See code above; reproduces ValueError originating from _wrap_in_pandas_container assigning a mismatched index to a pandas output.

Risk and Compatibility Notes

  • Behavior change is scoped to transform_output="pandas" and mismatched lengths.
  • Equal-length behavior remains unchanged (still aligns to original index).
  • Enables aggregate-style transformers with pandas outputs to pass through and be concatenated by index, consistent with pd.concat semantics already used by FeatureUnion/ColumnTransformer.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions