-
Notifications
You must be signed in to change notification settings - Fork 0
Description
User request
X does not have valid feature names, but IsolationForest was fitted with feature names
Describe the bug
If you fit an IsolationForest using a pd.DataFrame it generates a warning:
X does not have valid feature names, but IsolationForest was fitted with feature names
This only occurs when a non-default contamination value (i.e., not "auto") is supplied. The warning is unexpected because (a) X does have valid feature names and (b) it is raised by fit(), whereas this warning usually indicates that predict/score methods were called with an ndarray after fitting on a DataFrame.
Root cause: When contamination != "auto", IsolationForest computes offset_ by internally calling score_samples on the training data. At that point X has already been validated and converted to an array, triggering a feature-name mismatch when score_samples re-validates with reset=False.
Steps/Code to Reproduce
from sklearn.ensemble import IsolationForest
import pandas as pd
X = pd.DataFrame({"a": [-1.1, 0.3, 0.5, 100]})
clf = IsolationForest(random_state=0, contamination=0.05).fit(X)Expected Results
No "X does not have valid feature names" warning during fit.
Actual Results
UserWarning during fit:
X does not have valid feature names, but IsolationForest was fitted with feature names
Versions
Reproducible with scikit-learn 1.2.x and main; see upstream reports.
Implementation specification (from research)
Rationale: Avoid false-positive feature-name mismatch warnings during internal calls to score_samples in fit() while preserving feature-name validation for all user-facing methods.
Proposed changes (sklearn/ensemble/_iforest.py):
- Introduce a private method
_score_samples_no_validation(self, X)that computes score samples without calling_validate_data. - Update public
score_samples(self, X)to validate inputs and then delegate to_score_samples_no_validation(X). - In
fit(), whencontamination != 'auto', call the private method instead of the public one to computeoffset_.
Testing plan (sklearn/ensemble/tests/test_iforest.py):
- test_iforest_fit_dataframe_contamination_no_warning: Fit with a pandas DataFrame and contamination set; assert no feature-name warning is emitted by fit.
- test_iforest_dataframe_then_ndarray_warns_on_score_and_predict: Fit with DataFrame; calling score_samples/predict with ndarray should still warn.
- test_iforest_fit_dataframe_auto_no_warning_and_offset: With contamination='auto', fit does not warn and offset_ remains the documented value.
These changes follow the pattern used in scikit-learn PR scikit-learn#24873 (public method validates, private method performs computation), ensuring internal calls during fit do not re-trigger feature-name checks.