Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use grid_search in notebook and add visualization #18

Merged
merged 1 commit into from
Aug 2, 2016

Conversation

dhimmel
Copy link
Member

@dhimmel dhimmel commented Jul 28, 2016

Addresses issues with example notebook brought up at July 26 meetup:

  1. Standardize training and testing separately
  2. Use AUROC on continuous rather than binary predictions

Clean up variable names. Simplify to to testing/training terminology. No more "hold out".

Use sklearn.grid_search.GridSearchCV to optimize hyperparameters. Expand range of l1_ratio and alpha. Specify random_state in GridSearchCV, which should prevent having to set the seed manually using the random module. Grid search should enable a more modular architecture enabling swapping in different algorithms as long as their param_grid is defined.

Add exploratory analysis of predictions.

Add parallel processing using joblib to speed up cross validation.

Remove median absolute deviation feature selection. This step had to be removed or modified because it used testing data for feature selection.

url = 'https://ndownloader.figshare.com/files/5514386'
if not os.path.exists('data/expression.tsv.bz2'):
urllib.request.urlretrieve(url, os.path.join('data', 'expression.tsv.bz2'))
get_ipython().run_cell_magic('time', '', "path = os.path.join('data', 'expression.tsv.bz2')\nX = pd.read_table(path, index_col=0)")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's up with these run_cell_magic things?

Copy link
Member Author

@dhimmel dhimmel Jul 28, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See Cell 7. I used the time magic because I thought it would be helpful to report runtime. The downside is that it doesn't convert nicely to scripts.

FYI, here's the code of cell 7 that created the line you commented on:

%%time
path = os.path.join('data', 'expression.tsv.bz2')
X = pd.read_table(path, index_col=0)

@dhimmel
Copy link
Member Author

dhimmel commented Jul 28, 2016

Suggesting @RenasonceGent as one reviewer. @RenasonceGent you should see if you understand the code as the changes address some of the modularity goals that we have in mind (#12 (comment)).

Addresses issues with example notebook brought up at July 26 meetup:

1. Standardize training and testing separately
2. Use AUROC on continuous rather than binary predictions

Clean up variable names. Simplify to to testing/training terminology. No more
"hold out".

Use sklearn.grid_search.GridSearchCV to optimize hyperparameters. Expand range
of l1_ratio and alpha. Specify random_state in GridSearchCV, which should
prevent having to set the seed manually using the random module. Grid search
should enable a more modular architecture enabling swapping in different
algorithms as long as their `param_grid` is defined.

Add exploratory analysis of predictions.

Add parallel processing using joblib to speed up cross validation.

Remove median absolute deviation feature selection. This step had to be removed
or modified because it used testing data for feature selection.
@dhimmel dhimmel force-pushed the example-notebook branch from 9557e47 to 84a3271 Compare July 28, 2016 00:56
@cgreene
Copy link
Member

cgreene commented Jul 28, 2016

"Remove median absolute deviation feature selection. This step had to be removed or modified because it used testing data for feature selection."

Was the concern around this overfitting or something else? It should not be an issue for overfitting since it is independent of whatever the end goal of prediction is.

@dhimmel
Copy link
Member Author

dhimmel commented Jul 28, 2016

@cgreene, I agree that overfitting is only an issue when y_test is included in any step other than testing. Since the MAD feature selection @gwaygenomics implemented uses only X (X_train + X_test), there shouldn't be overfitting.

However, @gwaygenomics mentioned that at the meetup someone suggested not using X_test before testing. This would more realistically estimate performance on new samples. I interpreted this to mean that all feature selection should only use X_train and feature transformation should be independently performed for X_test. I'm not sure this is the best approach, any thoughts?

@htcai
Copy link
Member

htcai commented Jul 29, 2016

@dhimmel maybe we can compare the test results of the two different approaches, using X_test before testing vs. not using, and see whether the former is likely to lead to over-fitting.

@gwaybio
Copy link
Member

gwaybio commented Jul 29, 2016

@dhimmel - it was @yl565 who suggested not z-scoring the holdout set with the training/testing set. Which is the right thing to do and something I saw you amended in this PR.

Simplify to to testing/training terminology. No more "hold out".

I like using hold_out terminology here (and in general) especially when cross validation is being used to select parameters. The CV folds are training and testing sets and are built in to GridSearchCV.

Remove median absolute deviation feature selection. This step had to be removed or modified because it used testing data for feature selection.

I am ok with removing this step here but it's important to keep in mind that pancan classifier performance does not scale linearly with including increasing amounts of genes. Performance generally plateaus surprisingly early in number of MAD genes included and makes the algorithms speedy. Here is a plot describing this phenomenon with RAS predictions.

number_of_genes_prediction_accuracyall_ras_cv_results

On another note - I really like the plots and analysis you added to the notebook! I particularly like the 'potential hidden responders' table. What's really nice about these samples is that they are really well characterized and several resources exist to visualize what's going on in them. For example TCGA-E2-A1LI-01 is a breast tumor and can be visualized really nicely in the COSMIC Browser. I was looking for other potential reasons why this sample may "look like" a TP53 mutant based on gene expression signatures (like copy number loss, structural variants, etc.) but it does not look like there is anything obvious popping out to me. Cool results!

@dhimmel
Copy link
Member Author

dhimmel commented Jul 29, 2016

Terminology

I like using hold_out terminology here (and in general) especially when cross validation is being used to select parameters. The CV folds are training and testing sets and are built in to GridSearchCV.

@gwaygenomics, I agree that the terminology for these concepts is muddled. And hold_out has an intuitive meaning. However, the best online advice I could find on the matter (1 and 2) seem to support to following terminology:

  1. training: comparable to the training fold in cross-validation
  2. validation: comparable to the evaluation fold in cross-validation
  3. testing: comparable to the hold out set.

Since a good implementation of cross-validation makes it so you never actually have to touch the validation data (2 above), I thought it would be simplest to just use training/testing terminology. Let me know if you still disagree -- I'd like to find out the optimal terminology.

@gwaybio
Copy link
Member

gwaybio commented Jul 29, 2016

@dhimmel the terminology is often confusing and I definitely agree that we need to keep it consistent!

One key difference in biological sciences as compared to AI (as in those links you sent over) a validation set usually means something slightly different. Using this classifier as an example - we perform the following scenario:

  1. "hold out" - this is split out from the original data at the very beginning and not touched until hyperparamaters are selected to make a good estimate of how the classifier will perform with data it has never seen before.
    1. A contentious point that is also domain specific - do we combine the hold_out set with the cross_validation/ training/testing set to build the final classifier? In most cases when we use a gene mutation status as our Y matrix, we will not have the luxury of defining a hold_out set because of very low positive samples. I've adopted a different strategy here that I've been calling bootstrap holdout but it can get pretty computationally intensive and then isn't a true holdout.
  2. "cross validation" - usually k-fold cross validation where in each iteration there are k-1 folds used as training and the k-i fold is used as testing (i'm not sure what you mean by "a good implementation of cross-validation makes it so that you never actually have to touch the validation data (2 above)"?)
  3. "validation" - apply the optimal classifier on a completely different cancer dataset. In cognoma's case, a user may want to apply a cognoma derived classifier to their own data to "validate".

@dhimmel
Copy link
Member Author

dhimmel commented Aug 1, 2016

So going through the discussion so far, here the following unresolved issues:

  1. MAD feature selection, which now has a dedicated issue Median absolute deviation feature selection #22
  2. feature transformation/selection on X versus X_train, which now has a dedicated issue Should testing data be used for unsupervised feature tranformation or selection #23
  3. terminology for dataset partitions.

Items 1 & 2 are not intrinsic to this pull request -- we can address them in future pull requests.

For 3, we perhaps should settle on a terminology for the example notebook. I used:

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=0)

I believe @gwaygenomics advocates for (Y/N?):

X_train, X_holdout, y_train, y_holdout = train_test_split(X, y, test_size=0.1, random_state=0)

@cgreene, what's your opinion?

@yl565
Copy link
Contributor

yl565 commented Aug 1, 2016

About using X_test to improve classifier, sometimes it has been done in the field of EEG classification using adaptive/semi-supervised algorithm to improve classifier especially when training sample size is small, e.g. combining COV(X_test) and COV(X_train) in real-time to update a LDA whenever a new instance of x_test is available. Here are some examples:

I think using X instead of X_train for feature transformation/selection can be considered as a special case of semi-supervised learning.


Note by @dhimmel: I modified this comment to use DOI links for persistence and unique identification.

@gwaybio
Copy link
Member

gwaybio commented Aug 1, 2016

I believe @gwaygenomics advocates for (Y/N?):

Yep! That is how I think about Train/Test/Holdout for cross validation and hyperparameter selection.

If we decide a more simple structure of Train/Test is preferred, we should use evaluation set before validation set in cross validation. Using validation here is not how biologists would view "validation"

@RenasonceGent
Copy link

@dhimmel Everything looks good and seems to run fine. Just to check, how many plots should I see on the output? It took a few days to get all of the packages working on my system, and this is my first time using IPython.

Also, I'm still not sure I understand what you want. I've been writing a class with each section of the process broken up as functions. I thought it was mentioned before that the idea is to receive a JSON from one of the other groups that will tell us what to do. I'm leaving the parsing for later, but it is set up with that expectation.

Do we have details on what we will receive from the data group? Right now I'm only incorporating grid search. I assume they will tell us the location of the data, which classifier algorithm to use, the parameters that go with that classifier, which metrics to use, and what type of plots to do.

@dhimmel
Copy link
Member Author

dhimmel commented Aug 2, 2016

@RenasonceGent glad you got things running. You should see four plots, like you see here. Regarding installation, #15 should make it considerably easier.

Right now I'm only incorporating grid search.

I think that's a safe assumption. Grid search is really versatile, even with just one hyperparameter combination, is an easy way to perform cross validation.

Do we have details on what we will receive from the data group?

Let's assume right now that we get a list of sample_ids and gene_ids for subsetting the feature matrix, as well as a mutation status vector for the outcome. We will also get an algorithm. For the algorithm we should have a default hyperparameter grid, since user's aren't going to want to deal with this. The goal of the next week should be to have individuals pick an algorithm to identify whether it's appropriate for our application and identify a default hyperparameter grid.

I've been writing a class with each section of the process broken up as functions.

This sounds useful. Make sure not to duplicate functionality with the sklearn API. For the task for the next week, I think people can just modify 1.TCGA-MLexample.ipynb to swap in their algorithm. What do you think?

@dhimmel
Copy link
Member Author

dhimmel commented Aug 2, 2016

I'm merging despite there being some unresolved issues. Let's move discussion elsewhere and submit additional pull requests to modify 1.TCGA-MLexample.ipynb in an incremental fashion. We need to get these updates in for next week's activities.

@dhimmel dhimmel merged commit ae27311 into cognoma:master Aug 2, 2016
@dhimmel dhimmel deleted the example-notebook branch August 8, 2016 14:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants