forked from scikit-learn/scikit-learn
-
Notifications
You must be signed in to change notification settings - Fork 0
Open
Description
Unable to pass splits to SequentialFeatureSelector
Describe the bug
Passing an iterable of splits (e.g., a generator from a CV splitter) to SequentialFeatureSelector fails. Using cv=5 works, but cv=splits generated via LeaveOneGroupOut.split(X, y, groups) triggers an IndexError.
Steps/Code to Reproduce
from sklearn.datasets import make_classification
from sklearn.feature_selection import SequentialFeatureSelector
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import LeaveOneGroupOut
import numpy as np
X, y = make_classification()
groups = np.zeros_like(y, dtype=int)
groups[y.size//2:] = 1
cv = LeaveOneGroupOut()
splits = cv.split(X, y, groups=groups)
clf = KNeighborsClassifier(n_neighbors=5)
seq = SequentialFeatureSelector(clf, n_features_to_select=5, scoring='accuracy', cv=splits)
seq.fit(X, y)Expected Results
Runs without errors.
Actual Results
IndexError: list index out of range
File "sklearn/model_selection/_validation.py", line 1930, in _aggregate_score_dicts
for key in scores[0]
Observed Failure and Explanation
- When cv is a generator, it is consumed during the first scoring pass. Subsequent passes see an exhausted iterable, leading to empty results and IndexError in _aggregate_score_dicts.
- Converting to a list (cv=list(splits)) works around the issue but is inconsistent with typical cv handling.
Specification (research by Emerson Gray)
- Align SequentialFeatureSelector cv handling with *SearchCV classes.
- In fit, call
check_cv(self.cv, y, classifier=is_classifier(self.estimator))once to normalize cv. - If cv is an iterable/generator,
check_cvwill materialize it once to a reusable list via its wrapper. - Thread the normalized cv into all cross_val_score calls within SFS to prevent re-listing/re-consuming.
- Update SFS cv parameter docs to mirror *SearchCV: accept int, splitter objects, or iterables of (train, test) indices. Note that iterables are materialized once (memory cost).
Implementation Outline
- File: sklearn/feature_selection/_sequential.py
- Imports: add
from ..base import is_classifierandfrom ..model_selection import check_cv, cross_val_score. - In
SequentialFeatureSelector.fit:cv = check_cv(self.cv, y, classifier=is_classifier(self.estimator)), then passcvinto scoring helper. - Update
_get_best_new_feature_scoreto acceptcvand use it incross_val_score. - Update docstring for
cvparameter.
- Imports: add
Non-regression Tests
- File: sklearn/feature_selection/tests/test_sequential.py
test_sfs_supports_iterable_cv_generator: build splits via LeaveOneGroupOut.split (generator), use as cv in SFS, ensure fit completes and selects expected number of features.test_sfs_baseline_cv_int_runs: verify cv=5 baseline still works.
Versions
Originally observed in 1.2.2; branch scikit-learn__scikit-learn-25973 targets this fix.
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels