Implement pre-filtering feature selection for p >> n datasets #29

DM-Berger · 2024-10-18T18:47:40Z

For datasets with over 200 features, df-analyze current feature selection methods, specifically, stepwise methods, are extremely inefficient. It would be nice to simply prefilter many-featured datasets down to a more manageable size before continuing with further tuning and selection, etc.

This should be based on either associations (e.g. thresholding mutual info or correlations), univariate predictions (though these would be too costly perhaps to get in the first place), and MultiSurf.

This could be implemented with a few arguments:

--pre-filter
- activate two-step feature selection
--pre-filter-method <relief|assoc|pred>
- specify filter method, e.g. Relief or univariariate associations/predictions
--pre-filter-option <option>
- e.g. MultiSurf/Turf/Surf for relief
- AUROC/F1/acc for univariate preds, if affordable
- mutual info / Pearson correlation / Cramer V for univariate associations
--n-pre-filter <n>
- Number of features to pre-filter down to

The text was updated successfully, but these errors were encountered:

DM-Berger added the enhancement New feature or request label Oct 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement pre-filtering feature selection for p >> n datasets #29

Implement pre-filtering feature selection for p >> n datasets #29

DM-Berger commented Oct 18, 2024

Implement pre-filtering feature selection for p >> n datasets #29

Implement pre-filtering feature selection for p >> n datasets #29

Comments

DM-Berger commented Oct 18, 2024