-
Notifications
You must be signed in to change notification settings - Fork 47
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Linear Discriminant Analysis #49
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very informative analysis - I just had one question. Also, I will wait on merging until you decide if you wanted to include your analysis last night of different num_features_kept
('decision_function', pipeline.decision_function(X)), | ||
('probability', pipeline.predict_proba(X)[:, 1]), | ||
]) | ||
predict_df['probability_str'] = predict_df['probability'].apply('{:.1%}'.format) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you get an error or warning with this assignment? If you do, you could use
predict_df = predict_df.assign(probability_str=predict_df['probability'].apply('{:.1%}'.format))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your questions! Firstly, changing the n_feature_kept to 500 or 100 does not reduce overfitting, as I found last night. I have also tried to search over larger values of alpha [0.01, 0.1, 1, 10, 100, 1000], but the larger values led to much worse performance in cross-validation. Therefore, the optimal alpha is 0.01 and there is no noticeable change in testing. I am looking for solutions.
By the way, there is no warning or error at the code of predict_df.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
download-data.ipynb
duplicates functionality with 1.download.ipynb
. Also these algorithm pull requests shouldn't really be adding notebooks to the root directory.
What is the difference between feature-transformation.ipynb
in root and algorithms/algorithms/SGDClassifier-LDA-htcai.ipynb
?
Sorry for the unknown files. I am not sure how they were produced. But I have reset my local repo to the earlier commit. I am trying to see whether I can reduce the over-fitting of LDA. The pull request will be updated soon. |
Sorry for the delay. I have searched for solutions and experimented different strategies. Finally, it turns out that applying PCA before LDA solves the problem of overfitting, following this suggestion. Specifically, PCA maps the initial data onto a 300-dimensional space and then LDA reduces the dimensionality to 10. Comparing with the approach of using only PCA to produce 2000 features, my LDA-PCA hybrid method achieves a higher testing AUROC (93.8% vs. 93.1%). In addition, the training is also faster (total time 10min 29s vs. 17min 50s). It might be interesting to perform a grid search over the two parameters for PCA and LDA (while keeping the hyper-parameters of the SGDClassifier constant. But the training definitely will take much more time. |
Very nice, can you export # Export notebook to a .py script for diff viewing
jupyter nbconvert --to=script --FilesWriter.build_directory=scripts SGDClassifier-LDA-htcai.ipynb |
@htcai here's one question. It looks like |
Hi Daniel, thank you for reminding me of generating the py file! It has been added to the scripts directory. In addition, I had the same question regarding incorporating |
# Supress joblib warning. See https://github.com/scikit-learn/scikit-learn/issues/6370 | ||
warnings.filterwarnings('ignore', message='Changing the shape of non-C contiguous array') | ||
clf_grid = grid_search.GridSearchCV(estimator=clf, param_grid=param_grid, n_jobs=-1, scoring='roc_auc') | ||
pipeline = make_pipeline( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It turns out our current pipeline setup is flawed (see #54) which could make your cross-validated accuracy appear higher. Testing should still be accurate. No need to fix in this pull request. Will merge.
Thanks for merging the pull request! I am also curious about the (minor) flaw in the pipeline. In addition, I am wondering whether we should specify the |
Yes, we should stratify by status in |
I replaced feature selection with Linear Discriminant Analysis. The feature data was compressed to 2000 dimensions. Also, I removed the plotting of coefficients, since the resultant features do not have intuitive interpretations as do the initial features.
It is my first time to implement LDA in a pipeline. I just added an instance of LDA to the pipeline. I suppose that the trained pipeline will transform the testing data when it is used to make predictions.
I tracked and printed the max memory during the training on my MacBook, which seems to compress memory. As a consequence, the printed max memory is much smaller than what I saw in the Activity Monitor.
There is significant over-fitting. The scores for training and testing are 99.4% vs. 90.4%.