-
Notifications
You must be signed in to change notification settings - Fork 47
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix cross-validation/grid search/pipeline setup #54
Conversation
Update data to latest available on figshare (v5). Expression and mutation have not changed since v4, so do not expect changes from this upgrade. Update sklearn to 0.18.0. See what's new at http://scikit-learn.org/0.18/whats_new.html. Particularly exciting is: > Fix incomplete `predict_proba` method delegation from `model_selection.GridSearchCV` to `linear_model.SGDClassifier` (scikit-learn/scikit-learn#7159) by Yichuan Liu (@yl565). `3.TCGA-MLexample_Pathway.ipynb` had incorrect data location. Fixed so all root notebooks would execute.
scikit-learn/scikit-learn#7536 identified that we were performing a risky cross-validation pipeline. Fixes `2.TCGA-MLexample.ipyb` to use the pipeline as an estimator in GridSearchCV rather than GridSearchCV as the last step of the pipeline. In other words, grid search now transforms separately on each cross-validation training fold rather than all of `X_train`.
Used `git checkout --ours` to merge all conflicts
Fixes the pipeline issue brought up in scikit-learn/scikit-learn#7536. The correct setup, which this PR activates, is to have transformations (as well as classifiers) trained separately for each CV fold. |
np.arange(len(X.columns)).reshape(1, -1) | ||
).tolist() | ||
|
||
coef_df = pd.DataFrame.from_items([ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice!!
clf_grid) | ||
# Parameter Sweep for Hyperparameters | ||
param_grid = { | ||
'select__k': [2000], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why only 2000 here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess I increased it from 500, which it was previousely. My intention was to leave it the same, but I guess 2000 is still on the low end based on #22. I'd rather deal with this in later PRs though.
'classify__loss': ['log'], | ||
'classify__penalty': ['elasticnet'], | ||
'classify__alpha': [10 ** x for x in range(-3, 1)], | ||
'classify__l1_ratio': [0, 0.2, 0.8, 1], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also, why small search space over l1 ratio?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One side effect of fixing the pipeline is that things are much slower as transformations are now separately run on every CV fold. On the sklearn issue we discussed caching because lot's of work is repeated ver batim, but caching is not available yet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how much slower?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My guess is 3-4 times slower.
|
||
# In[29]: | ||
cv_pipeline = GridSearchCV(estimator=pipeline, param_grid=param_grid, n_jobs=-1, scoring='roc_auc') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so that I'm clear, the GridSearchCV
will build the pipeline according to the steps defined on line 255? So that each step is performed separately in each fold. E.g. SelectKBest
is performed on each cross validation training fold independently?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SelectKBest
is fit on every cross validation training fold independently.
|
||
predict_df = pd.DataFrame.from_items([ | ||
('sample_id', X_sub.index), | ||
('testing', X_sub.index.isin(X_test.index).astype(int)), | ||
('status', y_sub), | ||
('decision_function', pipeline.decision_function(X_sub)), | ||
('probability', pipeline.predict_proba(X_sub)[:, 1]), | ||
('decision_function', cv_pipeline.decision_function(X_sub)), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the cv_pipeline
object is prefit with the best hyperparameters found with the optimal hyperparameters?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah cv_pipeline
is a GridSearchCV object which has refit=True
by default specifying "Refit the best estimator with the entire dataset".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good to know! Thanks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request looks good, just some simple comments/talking points
Builds on top of #53 since sklearn 18.0 was desired. Will rebase on master once #53 is merged. Until then, the first four commits should be considered extrinsic to this pull request.