-
Notifications
You must be signed in to change notification settings - Fork 47
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use grid_search in notebook and add visualization #18
Conversation
url = 'https://ndownloader.figshare.com/files/5514386' | ||
if not os.path.exists('data/expression.tsv.bz2'): | ||
urllib.request.urlretrieve(url, os.path.join('data', 'expression.tsv.bz2')) | ||
get_ipython().run_cell_magic('time', '', "path = os.path.join('data', 'expression.tsv.bz2')\nX = pd.read_table(path, index_col=0)") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's up with these run_cell_magic things?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See Cell 7. I used the time magic because I thought it would be helpful to report runtime. The downside is that it doesn't convert nicely to scripts.
FYI, here's the code of cell 7 that created the line you commented on:
%%time
path = os.path.join('data', 'expression.tsv.bz2')
X = pd.read_table(path, index_col=0)
Suggesting @RenasonceGent as one reviewer. @RenasonceGent you should see if you understand the code as the changes address some of the modularity goals that we have in mind (#12 (comment)). |
Addresses issues with example notebook brought up at July 26 meetup: 1. Standardize training and testing separately 2. Use AUROC on continuous rather than binary predictions Clean up variable names. Simplify to to testing/training terminology. No more "hold out". Use sklearn.grid_search.GridSearchCV to optimize hyperparameters. Expand range of l1_ratio and alpha. Specify random_state in GridSearchCV, which should prevent having to set the seed manually using the random module. Grid search should enable a more modular architecture enabling swapping in different algorithms as long as their `param_grid` is defined. Add exploratory analysis of predictions. Add parallel processing using joblib to speed up cross validation. Remove median absolute deviation feature selection. This step had to be removed or modified because it used testing data for feature selection.
9557e47
to
84a3271
Compare
"Remove median absolute deviation feature selection. This step had to be removed or modified because it used testing data for feature selection." Was the concern around this overfitting or something else? It should not be an issue for overfitting since it is independent of whatever the end goal of prediction is. |
@cgreene, I agree that overfitting is only an issue when However, @gwaygenomics mentioned that at the meetup someone suggested not using |
@dhimmel maybe we can compare the test results of the two different approaches, using X_test before testing vs. not using, and see whether the former is likely to lead to over-fitting. |
@dhimmel - it was @yl565 who suggested not z-scoring the
I like using
I am ok with removing this step here but it's important to keep in mind that pancan classifier performance does not scale linearly with including increasing amounts of genes. Performance generally plateaus surprisingly early in number of MAD genes included and makes the algorithms speedy. Here is a plot describing this phenomenon with RAS predictions. On another note - I really like the plots and analysis you added to the notebook! I particularly like the 'potential hidden responders' table. What's really nice about these samples is that they are really well characterized and several resources exist to visualize what's going on in them. For example |
Terminology
@gwaygenomics, I agree that the terminology for these concepts is muddled. And hold_out has an intuitive meaning. However, the best online advice I could find on the matter (1 and 2) seem to support to following terminology:
Since a good implementation of cross-validation makes it so you never actually have to touch the validation data (2 above), I thought it would be simplest to just use training/testing terminology. Let me know if you still disagree -- I'd like to find out the optimal terminology. |
@dhimmel the terminology is often confusing and I definitely agree that we need to keep it consistent! One key difference in biological sciences as compared to AI (as in those links you sent over) a
|
So going through the discussion so far, here the following unresolved issues:
Items 1 & 2 are not intrinsic to this pull request -- we can address them in future pull requests. For 3, we perhaps should settle on a terminology for the example notebook. I used:
I believe @gwaygenomics advocates for (Y/N?):
@cgreene, what's your opinion? |
About using X_test to improve classifier, sometimes it has been done in the field of EEG classification using adaptive/semi-supervised algorithm to improve classifier especially when training sample size is small, e.g. combining COV(X_test) and COV(X_train) in real-time to update a LDA whenever a new instance of x_test is available. Here are some examples: I think using X instead of X_train for feature transformation/selection can be considered as a special case of semi-supervised learning. Note by @dhimmel: I modified this comment to use DOI links for persistence and unique identification. |
Yep! That is how I think about Train/Test/Holdout for cross validation and hyperparameter selection. If we decide a more simple structure of Train/Test is preferred, we should use |
@dhimmel Everything looks good and seems to run fine. Just to check, how many plots should I see on the output? It took a few days to get all of the packages working on my system, and this is my first time using IPython. Also, I'm still not sure I understand what you want. I've been writing a class with each section of the process broken up as functions. I thought it was mentioned before that the idea is to receive a JSON from one of the other groups that will tell us what to do. I'm leaving the parsing for later, but it is set up with that expectation. Do we have details on what we will receive from the data group? Right now I'm only incorporating grid search. I assume they will tell us the location of the data, which classifier algorithm to use, the parameters that go with that classifier, which metrics to use, and what type of plots to do. |
@RenasonceGent glad you got things running. You should see four plots, like you see here. Regarding installation, #15 should make it considerably easier.
I think that's a safe assumption. Grid search is really versatile, even with just one hyperparameter combination, is an easy way to perform cross validation.
Let's assume right now that we get a list of sample_ids and gene_ids for subsetting the feature matrix, as well as a mutation status vector for the outcome. We will also get an algorithm. For the algorithm we should have a default hyperparameter grid, since user's aren't going to want to deal with this. The goal of the next week should be to have individuals pick an algorithm to identify whether it's appropriate for our application and identify a default hyperparameter grid.
This sounds useful. Make sure not to duplicate functionality with the sklearn API. For the task for the next week, I think people can just modify |
I'm merging despite there being some unresolved issues. Let's move discussion elsewhere and submit additional pull requests to modify |
Addresses issues with example notebook brought up at July 26 meetup:
Clean up variable names. Simplify to to testing/training terminology. No more "hold out".
Use
sklearn.grid_search.GridSearchCV
to optimize hyperparameters. Expand range of l1_ratio and alpha. Specify random_state in GridSearchCV, which should prevent having to set the seed manually using the random module. Grid search should enable a more modular architecture enabling swapping in different algorithms as long as theirparam_grid
is defined.Add exploratory analysis of predictions.
Add parallel processing using joblib to speed up cross validation.
Remove median absolute deviation feature selection. This step had to be removed or modified because it used testing data for feature selection.