Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Voting methods for feature ranking in efs #112

Merged
merged 75 commits into from
Nov 30, 2024
Merged

Voting methods for feature ranking in efs #112

merged 75 commits into from
Nov 30, 2024

Conversation

bblodfon
Copy link
Contributor

@bblodfon bblodfon commented Jul 31, 2024

  • Use fastVoteR, where 4 voting theory methods are now implemented in Rcpp
  • Add embedded_ensemble_fselect()
  • Refactoring/Simplified code on both ensemble feature selection functions and EnsembleFSResult()

#' can be changed with `$set_active_measure()`.
#' @param inner_measure ([mlr3::Measure])\cr
#' The inner measure used to optimize and score the learners on the train sets
#' generated during the ensemble feature selection process.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we say that differently? Scoring on a train set sounds wrong. Is this the outer train set which is split by the inner resampling? We score the inner resample result?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, its the outer train set. The inner_resampling generates N train/test splits. The inner_measure is used to optimize/tune on the train set and you get the best subset and final model + score on that train set. We use these final models to also score the corresponding test splits (the inner resampling result you ask), with the measure. In embedded efs we only do the second (no inner_measure is needed/used).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can change the wording to specifically mention the train/test splits of the inner resampling (I also mentionthat earlier in the doc), what do you think?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you get the best subset and final model + score on that train set

It is the final model with the best subset and corresponding performance estimated on the inner resampling. There is no scoring on the outer training set but scoring on the inner resampling result. This is very similar to nested resampling. Maybe stick to the words used bellow figure 4.5

https://mlr3book.mlr-org.com/chapters/chapter4/hyperparameter_optimization.html#sec-nested-resampling

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes sorry Marc, it's as you say, when I was writing the above comment, I meant outer resampling (what we call init_resampling) as the one that generates the train/test splits. And yes, pretty much we are doing nested CV, with outer resampling the N times holdout split. I will update the doc

@be-marc be-marc merged commit 003f6e9 into main Nov 30, 2024
5 checks passed
@be-marc be-marc deleted the voting_methods branch November 30, 2024 11:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants