Skip to content

Comments

fix: relax pandas index wrapping (swev-id: scikit-learn__scikit-learn-25747)#73

Open
casey-brooks wants to merge 2 commits intoscikit-learn__scikit-learn-25747from
fix/pandas-index-wrap-25747
Open

fix: relax pandas index wrapping (swev-id: scikit-learn__scikit-learn-25747)#73
casey-brooks wants to merge 2 commits intoscikit-learn__scikit-learn-25747from
fix/pandas-index-wrap-25747

Conversation

@casey-brooks
Copy link

Summary

  • relax _wrap_in_pandas_container so pandas outputs keep their own index when row counts change and only align when lengths match
  • adjust ndarray wrapping to ignore mismatched indices while preserving column naming
  • add targeted FeatureUnion and utility tests covering mismatch and aligned cases

Testing

  • pytest -q sklearn/utils/tests/test_set_output.py::test__wrap_in_pandas_container_preserve_index_on_length_mismatch_dataframe
  • pytest -q sklearn/utils/tests/test_set_output.py::test__wrap_in_pandas_container_ndarray_ignore_index_on_length_mismatch
  • pytest -q sklearn/utils/tests/test_set_output.py::test__wrap_in_pandas_container_align_when_lengths_match
  • pytest -q sklearn/tests/test_pipeline.py::test_feature_union_pandas_preserves_aggregated_index
  • pytest -q sklearn/tests/test_pipeline.py::test_feature_union_pandas_aligns_index_when_lengths_match
  • flake8 sklearn/utils/_set_output.py sklearn/utils/tests/test_set_output.py sklearn/tests/test_pipeline.py

Reproduction

Steps from #72:

import pandas as pd
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn import set_config
from sklearn.pipeline import make_union

index = pd.date_range(start="2020-01-01", end="2020-01-05", inclusive="left", freq="H")
data = pd.DataFrame(index=index, data=[10] * len(index), columns=["value"])
data["date"] = index.date

class MyTransformer(BaseEstimator, TransformerMixin):
    def fit(self, X, y=None):
        return self

    def transform(self, X, y=None):
        return X["value"].groupby(X["date"]).sum()

set_config(transform_output="default")
print(make_union(MyTransformer()).fit_transform(data))

set_config(transform_output="pandas")
print(make_union(MyTransformer()).fit_transform(data))

Observed failure:

ValueError: Length mismatch: Expected axis has 4 elements, new values have 96 elements

Stack trace summary: the error is raised in sklearn/utils/_set_output.py when _wrap_in_pandas_container assigns the original input index to a pandas output with fewer rows.

Fixes #72.

@casey-brooks casey-brooks requested a review from a team January 13, 2026 21:29
@casey-brooks
Copy link
Author

Local Testing

  • pytest -q sklearn/utils/tests/test_set_output.py::test__wrap_in_pandas_container_preserve_index_on_length_mismatch_dataframe ➜ 1 passed
  • pytest -q sklearn/utils/tests/test_set_output.py::test__wrap_in_pandas_container_ndarray_ignore_index_on_length_mismatch ➜ 1 passed
  • pytest -q sklearn/utils/tests/test_set_output.py::test__wrap_in_pandas_container_align_when_lengths_match ➜ 1 passed
  • pytest -q sklearn/tests/test_pipeline.py::test_feature_union_pandas_preserves_aggregated_index ➜ 1 passed
  • pytest -q sklearn/tests/test_pipeline.py::test_feature_union_pandas_aligns_index_when_lengths_match ➜ 1 passed
  • flake8 sklearn/utils/_set_output.py sklearn/utils/tests/test_set_output.py sklearn/tests/test_pipeline.py ➜ no issues

Copy link

@noa-lucent noa-lucent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the fixes and the added coverage. I left one blocking comment: we still need an integration test that exercises a FeatureUnion transformer returning a pandas Series, which is the exact shape that regressed in #72. That path bypasses the new DataFrame branch, so we should lock it down explicitly.

@casey-brooks
Copy link
Author

Local Testing Update

  • pytest -q sklearn/tests/test_pipeline.py::test_feature_union_pandas_preserves_aggregated_index_series ➜ 1 passed
  • pytest -q sklearn/tests/test_pipeline.py::test_feature_union_pandas_preserves_aggregated_index ➜ 1 passed
  • pytest -q sklearn/tests/test_pipeline.py::test_feature_union_pandas_aligns_index_when_lengths_match ➜ 1 passed
  • pytest -q sklearn/utils/tests/test_set_output.py::test__wrap_in_pandas_container_preserve_index_on_length_mismatch_dataframe ➜ 1 passed
  • pytest -q sklearn/utils/tests/test_set_output.py::test__wrap_in_pandas_container_ndarray_ignore_index_on_length_mismatch ➜ 1 passed
  • pytest -q sklearn/utils/tests/test_set_output.py::test__wrap_in_pandas_container_align_when_lengths_match ➜ 1 passed

Copy link

@noa-lucent noa-lucent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding the Series coverage and extending the wrapper logic. The new _wrap_in_pandas_container branch for Series plus the FeatureUnion integration test exercise the exact regression from #72, and the unit tests assert the DataFrame/ndarray behavior stays consistent. Looks good to me.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants