Skip to content

SFS: support iterable/generator cv without exhaustion (swev-id: scikit-learn__scikit-learn-25973) #55

@rowan-stein

Description

@rowan-stein

Unable to pass splits to SequentialFeatureSelector

Describe the bug
Passing an iterable of splits (e.g., a generator from a CV splitter) to SequentialFeatureSelector fails. Using cv=5 works, but cv=splits generated via LeaveOneGroupOut.split(X, y, groups) triggers an IndexError.

Steps/Code to Reproduce

from sklearn.datasets import make_classification
from sklearn.feature_selection import SequentialFeatureSelector
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import LeaveOneGroupOut
import numpy as np

X, y = make_classification()

groups = np.zeros_like(y, dtype=int)
groups[y.size//2:] = 1

cv = LeaveOneGroupOut()
splits = cv.split(X, y, groups=groups)

clf = KNeighborsClassifier(n_neighbors=5)

seq = SequentialFeatureSelector(clf, n_features_to_select=5, scoring='accuracy', cv=splits)
seq.fit(X, y)

Expected Results
Runs without errors.

Actual Results

IndexError: list index out of range
  File "sklearn/model_selection/_validation.py", line 1930, in _aggregate_score_dicts
    for key in scores[0]

Observed Failure and Explanation

  • When cv is a generator, it is consumed during the first scoring pass. Subsequent passes see an exhausted iterable, leading to empty results and IndexError in _aggregate_score_dicts.
  • Converting to a list (cv=list(splits)) works around the issue but is inconsistent with typical cv handling.

Specification (research by Emerson Gray)

  • Align SequentialFeatureSelector cv handling with *SearchCV classes.
  • In fit, call check_cv(self.cv, y, classifier=is_classifier(self.estimator)) once to normalize cv.
  • If cv is an iterable/generator, check_cv will materialize it once to a reusable list via its wrapper.
  • Thread the normalized cv into all cross_val_score calls within SFS to prevent re-listing/re-consuming.
  • Update SFS cv parameter docs to mirror *SearchCV: accept int, splitter objects, or iterables of (train, test) indices. Note that iterables are materialized once (memory cost).

Implementation Outline

  • File: sklearn/feature_selection/_sequential.py
    • Imports: add from ..base import is_classifier and from ..model_selection import check_cv, cross_val_score.
    • In SequentialFeatureSelector.fit: cv = check_cv(self.cv, y, classifier=is_classifier(self.estimator)), then pass cv into scoring helper.
    • Update _get_best_new_feature_score to accept cv and use it in cross_val_score.
    • Update docstring for cv parameter.

Non-regression Tests

  • File: sklearn/feature_selection/tests/test_sequential.py
    • test_sfs_supports_iterable_cv_generator: build splits via LeaveOneGroupOut.split (generator), use as cv in SFS, ensure fit completes and selects expected number of features.
    • test_sfs_baseline_cv_int_runs: verify cv=5 baseline still works.

Versions
Originally observed in 1.2.2; branch scikit-learn__scikit-learn-25973 targets this fix.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions