Skip to content

IsolationForest: feature names warning during fit when contamination != 'auto' #51

@rowan-stein

Description

@rowan-stein

User request

X does not have valid feature names, but IsolationForest was fitted with feature names

Describe the bug

If you fit an IsolationForest using a pd.DataFrame it generates a warning:

X does not have valid feature names, but IsolationForest was fitted with feature names

This only occurs when a non-default contamination value (i.e., not "auto") is supplied. The warning is unexpected because (a) X does have valid feature names and (b) it is raised by fit(), whereas this warning usually indicates that predict/score methods were called with an ndarray after fitting on a DataFrame.

Root cause: When contamination != "auto", IsolationForest computes offset_ by internally calling score_samples on the training data. At that point X has already been validated and converted to an array, triggering a feature-name mismatch when score_samples re-validates with reset=False.

Steps/Code to Reproduce

from sklearn.ensemble import IsolationForest
import pandas as pd

X = pd.DataFrame({"a": [-1.1, 0.3, 0.5, 100]})
clf = IsolationForest(random_state=0, contamination=0.05).fit(X)

Expected Results

No "X does not have valid feature names" warning during fit.

Actual Results

UserWarning during fit:

X does not have valid feature names, but IsolationForest was fitted with feature names

Versions

Reproducible with scikit-learn 1.2.x and main; see upstream reports.

Implementation specification (from research)

Rationale: Avoid false-positive feature-name mismatch warnings during internal calls to score_samples in fit() while preserving feature-name validation for all user-facing methods.

Proposed changes (sklearn/ensemble/_iforest.py):

  1. Introduce a private method _score_samples_no_validation(self, X) that computes score samples without calling _validate_data.
  2. Update public score_samples(self, X) to validate inputs and then delegate to _score_samples_no_validation(X).
  3. In fit(), when contamination != 'auto', call the private method instead of the public one to compute offset_.

Testing plan (sklearn/ensemble/tests/test_iforest.py):

  • test_iforest_fit_dataframe_contamination_no_warning: Fit with a pandas DataFrame and contamination set; assert no feature-name warning is emitted by fit.
  • test_iforest_dataframe_then_ndarray_warns_on_score_and_predict: Fit with DataFrame; calling score_samples/predict with ndarray should still warn.
  • test_iforest_fit_dataframe_auto_no_warning_and_offset: With contamination='auto', fit does not warn and offset_ remains the documented value.

These changes follow the pattern used in scikit-learn PR scikit-learn#24873 (public method validates, private method performs computation), ensuring internal calls during fit do not re-trigger feature-name checks.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions