Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix cross-validation/grid search/pipeline setup #54

Merged
merged 8 commits into from
Oct 6, 2016

Conversation

dhimmel
Copy link
Member

@dhimmel dhimmel commented Oct 3, 2016

Builds on top of #53 since sklearn 18.0 was desired. Will rebase on master once #53 is merged. Until then, the first four commits should be considered extrinsic to this pull request.

Update data to latest available on figshare (v5). Expression and mutation have
not changed since v4, so do not expect changes from this upgrade.

Update sklearn to 0.18.0. See what's new at
http://scikit-learn.org/0.18/whats_new.html. Particularly exciting is:

> Fix incomplete `predict_proba` method delegation from
`model_selection.GridSearchCV` to `linear_model.SGDClassifier`
(scikit-learn/scikit-learn#7159) by Yichuan Liu (@yl565).

`3.TCGA-MLexample_Pathway.ipynb` had incorrect data location. Fixed so all
root notebooks would execute.
scikit-learn/scikit-learn#7536 identified that we were
performing a risky cross-validation pipeline. Fixes `2.TCGA-MLexample.ipyb` to
use the pipeline as an estimator in GridSearchCV rather than GridSearchCV as the
last step of the pipeline. In other words, grid search now transforms separately
on each cross-validation training fold rather than all of `X_train`.
@dhimmel dhimmel changed the title [WIP] Fix cross-validation/grid search/pipeline setup Fix cross-validation/grid search/pipeline setup Oct 6, 2016
@dhimmel
Copy link
Member Author

dhimmel commented Oct 6, 2016

Fixes the pipeline issue brought up in scikit-learn/scikit-learn#7536. The correct setup, which this PR activates, is to have transformations (as well as classifiers) trained separately for each CV fold.

np.arange(len(X.columns)).reshape(1, -1)
).tolist()

coef_df = pd.DataFrame.from_items([
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice!!

clf_grid)
# Parameter Sweep for Hyperparameters
param_grid = {
'select__k': [2000],
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why only 2000 here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess I increased it from 500, which it was previousely. My intention was to leave it the same, but I guess 2000 is still on the low end based on #22. I'd rather deal with this in later PRs though.

'classify__loss': ['log'],
'classify__penalty': ['elasticnet'],
'classify__alpha': [10 ** x for x in range(-3, 1)],
'classify__l1_ratio': [0, 0.2, 0.8, 1],
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also, why small search space over l1 ratio?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One side effect of fixing the pipeline is that things are much slower as transformations are now separately run on every CV fold. On the sklearn issue we discussed caching because lot's of work is repeated ver batim, but caching is not available yet.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how much slower?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My guess is 3-4 times slower.


# In[29]:
cv_pipeline = GridSearchCV(estimator=pipeline, param_grid=param_grid, n_jobs=-1, scoring='roc_auc')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so that I'm clear, the GridSearchCV will build the pipeline according to the steps defined on line 255? So that each step is performed separately in each fold. E.g. SelectKBest is performed on each cross validation training fold independently?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SelectKBest is fit on every cross validation training fold independently.


predict_df = pd.DataFrame.from_items([
('sample_id', X_sub.index),
('testing', X_sub.index.isin(X_test.index).astype(int)),
('status', y_sub),
('decision_function', pipeline.decision_function(X_sub)),
('probability', pipeline.predict_proba(X_sub)[:, 1]),
('decision_function', cv_pipeline.decision_function(X_sub)),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the cv_pipeline object is prefit with the best hyperparameters found with the optimal hyperparameters?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah cv_pipeline is a GridSearchCV object which has refit=True by default specifying "Refit the best estimator with the entire dataset".

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good to know! Thanks

Copy link
Member

@gwaybio gwaybio left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request looks good, just some simple comments/talking points

@dhimmel dhimmel merged commit beba9a5 into cognoma:master Oct 6, 2016
@dhimmel dhimmel deleted the cvfix branch October 6, 2016 19:29
@dhimmel dhimmel mentioned this pull request Dec 9, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants