Fix cross-validation/grid search/pipeline setup #54

dhimmel · 2016-10-03T21:15:08Z

Builds on top of #53 since sklearn 18.0 was desired. Will rebase on master once #53 is merged. Until then, the first four commits should be considered extrinsic to this pull request.

@yl565

Update data to latest available on figshare (v5). Expression and mutation have not changed since v4, so do not expect changes from this upgrade. Update sklearn to 0.18.0. See what's new at http://scikit-learn.org/0.18/whats_new.html. Particularly exciting is: > Fix incomplete `predict_proba` method delegation from `model_selection.GridSearchCV` to `linear_model.SGDClassifier` (scikit-learn/scikit-learn#7159) by Yichuan Liu (@yl565). `3.TCGA-MLexample_Pathway.ipynb` had incorrect data location. Fixed so all root notebooks would execute.

scikit-learn/scikit-learn#7536 identified that we were performing a risky cross-validation pipeline. Fixes `2.TCGA-MLexample.ipyb` to use the pipeline as an estimator in GridSearchCV rather than GridSearchCV as the last step of the pipeline. In other words, grid search now transforms separately on each cross-validation training fold rather than all of `X_train`.

Used `git checkout --ours` to merge all conflicts

dhimmel · 2016-10-06T18:08:17Z

Fixes the pipeline issue brought up in scikit-learn/scikit-learn#7536. The correct setup, which this PR activates, is to have transformations (as well as classifiers) trained separately for each CV fold.

gwaybio · 2016-10-06T19:11:46Z

scripts/2.TCGA-MLexample.py

+    np.arange(len(X.columns)).reshape(1, -1)
+).tolist()
+
+coef_df = pd.DataFrame.from_items([


gwaybio · 2016-10-06T19:12:25Z

scripts/2.TCGA-MLexample.py

-    clf_grid)
+# Parameter Sweep for Hyperparameters
+param_grid = {
+    'select__k': [2000],


Why only 2000 here?

I guess I increased it from 500, which it was previousely. My intention was to leave it the same, but I guess 2000 is still on the low end based on #22. I'd rather deal with this in later PRs though.

gwaybio · 2016-10-06T19:13:00Z

scripts/2.TCGA-MLexample.py

+    'classify__loss': ['log'],
+    'classify__penalty': ['elasticnet'],
+    'classify__alpha': [10 ** x for x in range(-3, 1)],
+    'classify__l1_ratio': [0, 0.2, 0.8, 1],


also, why small search space over l1 ratio?

One side effect of fixing the pipeline is that things are much slower as transformations are now separately run on every CV fold. On the sklearn issue we discussed caching because lot's of work is repeated ver batim, but caching is not available yet.

how much slower?

My guess is 3-4 times slower.

gwaybio · 2016-10-06T19:16:20Z

scripts/3.TCGA-MLexample_Pathway.py


-# In[29]:
+cv_pipeline = GridSearchCV(estimator=pipeline, param_grid=param_grid, n_jobs=-1, scoring='roc_auc')


so that I'm clear, the GridSearchCV will build the pipeline according to the steps defined on line 255? So that each step is performed separately in each fold. E.g. SelectKBest is performed on each cross validation training fold independently?

SelectKBest is fit on every cross validation training fold independently.

gwaybio · 2016-10-06T19:17:42Z

scripts/3.TCGA-MLexample_Pathway.py


 predict_df = pd.DataFrame.from_items([
    ('sample_id', X_sub.index),
    ('testing', X_sub.index.isin(X_test.index).astype(int)),
    ('status', y_sub),
-    ('decision_function', pipeline.decision_function(X_sub)),
-    ('probability', pipeline.predict_proba(X_sub)[:, 1]),
+    ('decision_function', cv_pipeline.decision_function(X_sub)),


the cv_pipeline object is prefit with the best hyperparameters found with the optimal hyperparameters?

Yeah cv_pipeline is a GridSearchCV object which has refit=True by default specifying "Refit the best estimator with the entire dataset".

good to know! Thanks

gwaybio

Pull request looks good, just some simple comments/talking points

dhimmel added 5 commits September 30, 2016 11:29

Rerun with up-to-date local conda env

d1276a9

Fix import deprecation warning

8d51995

Fix deprecation warning with cv grid scores

fe4b5a1

dhimmel mentioned this pull request Oct 4, 2016

Linear Discriminant Analysis #49

Merged

dhimmel added 3 commits October 6, 2016 11:17

Update coefficient table

aa5dad8

Fix pipeline in 3.TCGA-MLexample_Pathway

2e4e4da

Merge branch 'master' into cvfix-rebase

f4a2903

Used `git checkout --ours` to merge all conflicts

dhimmel changed the title ~~[WIP] Fix cross-validation/grid search/pipeline setup~~ Fix cross-validation/grid search/pipeline setup Oct 6, 2016

dhimmel assigned gwaybio Oct 6, 2016

gwaybio reviewed Oct 6, 2016

View reviewed changes

scripts/2.TCGA-MLexample.py

np.arange(len(X.columns)).reshape(1, -1)

).tolist()

coef_df = pd.DataFrame.from_items([

Copy link

Member

gwaybio Oct 6, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice!!

gwaybio reviewed Oct 6, 2016

View reviewed changes

gwaybio approved these changes Oct 6, 2016

View reviewed changes

dhimmel merged commit beba9a5 into cognoma:master Oct 6, 2016

dhimmel deleted the cvfix branch October 6, 2016 19:29

dhimmel mentioned this pull request Oct 11, 2016

Create the cognoml package to implement an MVP API #51

Merged

dhimmel mentioned this pull request Dec 9, 2016

Memory issue #70

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix cross-validation/grid search/pipeline setup #54

Fix cross-validation/grid search/pipeline setup #54

dhimmel commented Oct 3, 2016

dhimmel commented Oct 6, 2016

gwaybio Oct 6, 2016

gwaybio Oct 6, 2016

dhimmel Oct 6, 2016

gwaybio Oct 6, 2016

dhimmel Oct 6, 2016

gwaybio Oct 6, 2016

dhimmel Oct 6, 2016

gwaybio Oct 6, 2016

dhimmel Oct 6, 2016

gwaybio Oct 6, 2016

dhimmel Oct 6, 2016

gwaybio Oct 6, 2016

gwaybio left a comment


		# In[29]:
		cv_pipeline = GridSearchCV(estimator=pipeline, param_grid=param_grid, n_jobs=-1, scoring='roc_auc')

Fix cross-validation/grid search/pipeline setup #54

Fix cross-validation/grid search/pipeline setup #54

Conversation

dhimmel commented Oct 3, 2016

dhimmel commented Oct 6, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gwaybio left a comment

Choose a reason for hiding this comment