Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement pre-filtering feature selection for p >> n datasets #29

Open
DM-Berger opened this issue Oct 18, 2024 · 0 comments
Open

Implement pre-filtering feature selection for p >> n datasets #29

DM-Berger opened this issue Oct 18, 2024 · 0 comments
Labels
enhancement New feature or request

Comments

@DM-Berger
Copy link
Collaborator

For datasets with over 200 features, df-analyze current feature selection methods, specifically, stepwise methods, are extremely inefficient. It would be nice to simply prefilter many-featured datasets down to a more manageable size before continuing with further tuning and selection, etc.

This should be based on either associations (e.g. thresholding mutual info or correlations), univariate predictions (though these would be too costly perhaps to get in the first place), and MultiSurf.

This could be implemented with a few arguments:

  • --pre-filter
    • activate two-step feature selection
  • --pre-filter-method <relief|assoc|pred>
    • specify filter method, e.g. Relief or univariariate associations/predictions
  • --pre-filter-option <option>
    • e.g. MultiSurf/Turf/Surf for relief
    • AUROC/F1/acc for univariate preds, if affordable
    • mutual info / Pearson correlation / Cramer V for univariate associations
  • --n-pre-filter <n>
    • Number of features to pre-filter down to
@DM-Berger DM-Berger added the enhancement New feature or request label Oct 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant