Skip to content

Comments

Fix ColumnTransformer.set_output for remainder estimator (swev-id: scikit-learn__scikit-learn-26323)#62

Open
rowan-stein wants to merge 1 commit intoscikit-learn__scikit-learn-26323from
noa/issue-60
Open

Fix ColumnTransformer.set_output for remainder estimator (swev-id: scikit-learn__scikit-learn-26323)#62
rowan-stein wants to merge 1 commit intoscikit-learn__scikit-learn-26323from
noa/issue-60

Conversation

@rowan-stein
Copy link
Collaborator

Summary

This PR fixes a bug where ColumnTransformer.set_output ignores the remainder when it is an estimator. As a result, with set_output(transform="pandas"), the remainder path emitted numpy arrays, forcing a fallback to np.hstack and causing dtype/coercion issues and incorrect final assembly.

Fixes: #60

Root Cause

ColumnTransformer.set_output iterates over self.transformers and self.transformers_, but does not apply set_output to self.remainder (when it’s an estimator) in the common pre-fit case. Consequently, the remainder estimator continued to output numpy arrays, mixing types with DataFrame outputs from other transformers and breaking the intended pandas concatenation.

Fix

  • Update sklearn/compose/_column_transformer.py:
    • In ColumnTransformer.set_output, after applying _safe_set_output to named transformers, also apply it to self.remainder when it is an estimator (i.e., not 'passthrough' or 'drop').
  • This ensures the remainder estimator respects the set_output configuration (including transform='pandas') and enables proper DataFrame concatenation downstream.

Reproduction Steps

Using the user-provided example:

import pandas as pd
from sklearn.compose import make_column_selector, make_column_transformer
from sklearn.feature_selection import VarianceThreshold

df = pd.DataFrame({"a": [True, False, True], "b": [1, 2, 3]})
out1 = make_column_transformer(
    (VarianceThreshold(), make_column_selector(dtype_include=bool)),
    remainder=VarianceThreshold(),
    verbose_feature_names_out=False,
).set_output(transform="pandas").fit_transform(df)
print(out1)

out2 = make_column_transformer(
    (VarianceThreshold(), make_column_selector(dtype_include=bool)),
    (VarianceThreshold(), make_column_selector(dtype_exclude=bool)),
    verbose_feature_names_out=False,
).set_output(transform="pandas").fit_transform(df)
print(out2)

Observed Failure (Pre-fix)

Assertion-based test failures before the fix:

============================= test session starts ==============================
platform linux -- Python 3.11.14, pytest-9.0.2, pluggy-1.6.0
rootdir: /workspace/scikit-learn
configfile: setup.cfg
collected 190 items / 188 deselected / 2 selected

sklearn/compose/tests/test_column_transformer.py FF                      [100%]

=================================== FAILURES ===================================
________ test_column_transformer_remainder_estimator_set_output_pandas _________
...
>       assert transformed.dtypes["a"] == df.dtypes["a"]
E       AssertionError: assert dtype('int64') == dtype('bool')
...
_ test_column_transformer_remainder_estimator_matches_explicit_transformers_pandas _
...
>       pd.testing.assert_frame_equal(remainder_result, explicit_result)
E       AssertionError: DataFrame.columns are different
================= 2 failed, 188 deselected, 1 warning in 0.46s =================

After Fix

  • Both new tests pass. The remainder-estimator case now yields a pandas DataFrame with correct dtypes and matches the explicit two-transformer configuration.

Tests Added

  • test_column_transformer_remainder_estimator_set_output_pandas
  • test_column_transformer_remainder_estimator_matches_explicit_transformers_pandas

Implementation Details

  • Code changes are limited to ColumnTransformer.set_output in sklearn/compose/_column_transformer.py.
  • _safe_set_output is applied to self.remainder when it’s an estimator.
  • Edge cases:
    • remainder='drop': unchanged.
    • remainder='passthrough': identity transformer still handled via existing logic.
    • remainder as a Pipeline: Pipeline.set_output propagates as expected.

Environment Used (Local Validation)

  • Python 3.11.14 (conda)
  • numpy 1.26.4, scipy 1.11.4, cython 0.29.36, pandas, pytest
  • Editable scikit-learn build; standard test runner

Notes

  • No CI run triggered by us per task rules; local tests pass.

@rowan-stein rowan-stein requested a review from a team December 26, 2025 14:20
Copy link

@noa-lucent noa-lucent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants