Fix ColumnTransformer.set_output for remainder estimator (swev-id: scikit-learn__scikit-learn-26323) by rowan-stein · Pull Request #62 · agyn-sandbox/scikit-learn

rowan-stein · 2025-12-26T14:20:01Z

Summary

This PR fixes a bug where ColumnTransformer.set_output ignores the remainder when it is an estimator. As a result, with set_output(transform="pandas"), the remainder path emitted numpy arrays, forcing a fallback to np.hstack and causing dtype/coercion issues and incorrect final assembly.

Fixes: #60

Root Cause

ColumnTransformer.set_output iterates over self.transformers and self.transformers_, but does not apply set_output to self.remainder (when it’s an estimator) in the common pre-fit case. Consequently, the remainder estimator continued to output numpy arrays, mixing types with DataFrame outputs from other transformers and breaking the intended pandas concatenation.

Fix

Update sklearn/compose/_column_transformer.py:
- In ColumnTransformer.set_output, after applying _safe_set_output to named transformers, also apply it to self.remainder when it is an estimator (i.e., not 'passthrough' or 'drop').
This ensures the remainder estimator respects the set_output configuration (including transform='pandas') and enables proper DataFrame concatenation downstream.

Reproduction Steps

Using the user-provided example:

import pandas as pd
from sklearn.compose import make_column_selector, make_column_transformer
from sklearn.feature_selection import VarianceThreshold

df = pd.DataFrame({"a": [True, False, True], "b": [1, 2, 3]})
out1 = make_column_transformer(
    (VarianceThreshold(), make_column_selector(dtype_include=bool)),
    remainder=VarianceThreshold(),
    verbose_feature_names_out=False,
).set_output(transform="pandas").fit_transform(df)
print(out1)

out2 = make_column_transformer(
    (VarianceThreshold(), make_column_selector(dtype_include=bool)),
    (VarianceThreshold(), make_column_selector(dtype_exclude=bool)),
    verbose_feature_names_out=False,
).set_output(transform="pandas").fit_transform(df)
print(out2)

Observed Failure (Pre-fix)

Assertion-based test failures before the fix:

============================= test session starts ==============================
platform linux -- Python 3.11.14, pytest-9.0.2, pluggy-1.6.0
rootdir: /workspace/scikit-learn
configfile: setup.cfg
collected 190 items / 188 deselected / 2 selected

sklearn/compose/tests/test_column_transformer.py FF                      [100%]

=================================== FAILURES ===================================
________ test_column_transformer_remainder_estimator_set_output_pandas _________
...
>       assert transformed.dtypes["a"] == df.dtypes["a"]
E       AssertionError: assert dtype('int64') == dtype('bool')
...
_ test_column_transformer_remainder_estimator_matches_explicit_transformers_pandas _
...
>       pd.testing.assert_frame_equal(remainder_result, explicit_result)
E       AssertionError: DataFrame.columns are different
================= 2 failed, 188 deselected, 1 warning in 0.46s =================

After Fix

Both new tests pass. The remainder-estimator case now yields a pandas DataFrame with correct dtypes and matches the explicit two-transformer configuration.

Tests Added

test_column_transformer_remainder_estimator_set_output_pandas
test_column_transformer_remainder_estimator_matches_explicit_transformers_pandas

Implementation Details

Code changes are limited to ColumnTransformer.set_output in sklearn/compose/_column_transformer.py.
_safe_set_output is applied to self.remainder when it’s an estimator.
Edge cases:
- remainder='drop': unchanged.
- remainder='passthrough': identity transformer still handled via existing logic.
- remainder as a Pipeline: Pipeline.set_output propagates as expected.

Environment Used (Local Validation)

Python 3.11.14 (conda)
numpy 1.26.4, scipy 1.11.4, cython 0.29.36, pandas, pytest
Editable scikit-learn build; standard test runner

Notes

No CI run triggered by us per task rules; local tests pass.

noa-lucent

Looks good to me.

fix(compose): apply set_output to remainder estimator

031e8c4

rowan-stein requested a review from a team December 26, 2025 14:20

noa-lucent approved these changes Dec 26, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

Fix ColumnTransformer.set_output for remainder estimator (swev-id: scikit-learn__scikit-learn-26323)#62

Fix ColumnTransformer.set_output for remainder estimator (swev-id: scikit-learnscikit-learn-26323)#62
rowan-stein wants to merge 1 commit intoscikit-learnscikit-learn-26323from
noa/issue-60

rowan-stein commented Dec 26, 2025

Uh oh!

noa-lucent left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Comments

Conversation

rowan-stein commented Dec 26, 2025

Summary

Root Cause

Fix

Reproduction Steps

Observed Failure (Pre-fix)

After Fix

Tests Added

Implementation Details

Environment Used (Local Validation)

Notes

Uh oh!

noa-lucent left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants