Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Linear Discriminant Analysis #49

Merged
merged 2 commits into from
Oct 4, 2016
Merged

Linear Discriminant Analysis #49

merged 2 commits into from
Oct 4, 2016

Conversation

htcai
Copy link
Member

@htcai htcai commented Sep 19, 2016

I replaced feature selection with Linear Discriminant Analysis. The feature data was compressed to 2000 dimensions. Also, I removed the plotting of coefficients, since the resultant features do not have intuitive interpretations as do the initial features.

It is my first time to implement LDA in a pipeline. I just added an instance of LDA to the pipeline. I suppose that the trained pipeline will transform the testing data when it is used to make predictions.

I tracked and printed the max memory during the training on my MacBook, which seems to compress memory. As a consequence, the printed max memory is much smaller than what I saw in the Activity Monitor.

There is significant over-fitting. The scores for training and testing are 99.4% vs. 90.4%.

@htcai htcai changed the title tried LDA Linear Discriminant Analysis Sep 19, 2016
Copy link
Member

@gwaybio gwaybio left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very informative analysis - I just had one question. Also, I will wait on merging until you decide if you wanted to include your analysis last night of different num_features_kept

('decision_function', pipeline.decision_function(X)),
('probability', pipeline.predict_proba(X)[:, 1]),
])
predict_df['probability_str'] = predict_df['probability'].apply('{:.1%}'.format)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you get an error or warning with this assignment? If you do, you could use

predict_df = predict_df.assign(probability_str=predict_df['probability'].apply('{:.1%}'.format))

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your questions! Firstly, changing the n_feature_kept to 500 or 100 does not reduce overfitting, as I found last night. I have also tried to search over larger values of alpha [0.01, 0.1, 1, 10, 100, 1000], but the larger values led to much worse performance in cross-validation. Therefore, the optimal alpha is 0.01 and there is no noticeable change in testing. I am looking for solutions.

By the way, there is no warning or error at the code of predict_df.

Copy link
Member

@dhimmel dhimmel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

download-data.ipynb duplicates functionality with 1.download.ipynb. Also these algorithm pull requests shouldn't really be adding notebooks to the root directory.

What is the difference between feature-transformation.ipynb in root and algorithms/algorithms/SGDClassifier-LDA-htcai.ipynb?

@htcai
Copy link
Member Author

htcai commented Sep 22, 2016

Sorry for the unknown files. I am not sure how they were produced. But I have reset my local repo to the earlier commit. I am trying to see whether I can reduce the over-fitting of LDA. The pull request will be updated soon.

@htcai
Copy link
Member Author

htcai commented Sep 30, 2016

Sorry for the delay. I have searched for solutions and experimented different strategies. Finally, it turns out that applying PCA before LDA solves the problem of overfitting, following this suggestion.

Specifically, PCA maps the initial data onto a 300-dimensional space and then LDA reduces the dimensionality to 10. Comparing with the approach of using only PCA to produce 2000 features, my LDA-PCA hybrid method achieves a higher testing AUROC (93.8% vs. 93.1%). In addition, the training is also faster (total time 10min 29s vs. 17min 50s).

It might be interesting to perform a grid search over the two parameters for PCA and LDA (while keeping the hyper-parameters of the SGDClassifier constant. But the training definitely will take much more time.

@dhimmel
Copy link
Member

dhimmel commented Oct 3, 2016

Very nice, can you export algorithms/SGDClassifier-LDA-htcai.ipynb to a script (.py) file. From algorithms run:

# Export notebook to a .py script for diff viewing
jupyter nbconvert --to=script --FilesWriter.build_directory=scripts SGDClassifier-LDA-htcai.ipynb

@dhimmel
Copy link
Member

dhimmel commented Oct 3, 2016

@htcai here's one question. It looks like LinearDiscriminantAnalysis in sklearn can be used as a classifier. For example, it has a predict method. However, if I'm not mistaken, in your example it's used only as a transformer, since it's not the final step of the pipeline. Am I thinking about this correctly?

@htcai
Copy link
Member Author

htcai commented Oct 3, 2016

Hi Daniel, thank you for reminding me of generating the py file! It has been added to the scripts directory.

In addition, I had the same question regarding incorporating LinearDiscriminantAnalysis into the pipeline, and I understand the documentation of Pipeline in the same way as you do. In order to examine the correctness of this usage of LinearDiscriminantAnalysis, I have created a separate branch in my repo, in which I decomposed the pipeline into separate steps. The output is the same as that of using Pipeline, though I eliminated the section Investigate the predictions.

# Supress joblib warning. See https://github.com/scikit-learn/scikit-learn/issues/6370
warnings.filterwarnings('ignore', message='Changing the shape of non-C contiguous array')
clf_grid = grid_search.GridSearchCV(estimator=clf, param_grid=param_grid, n_jobs=-1, scoring='roc_auc')
pipeline = make_pipeline(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It turns out our current pipeline setup is flawed (see #54) which could make your cross-validated accuracy appear higher. Testing should still be accurate. No need to fix in this pull request. Will merge.

@dhimmel dhimmel merged commit 7f68db3 into cognoma:master Oct 4, 2016
@htcai
Copy link
Member Author

htcai commented Oct 4, 2016

Thanks for merging the pull request! I am also curious about the (minor) flaw in the pipeline.

In addition, I am wondering whether we should specify the stratify parameter in train_test_split, since the proportions of positive and negative samples are far from each other. Using the stratified splitting, the frequencies of positive samples will be approximately the same in the training and the testing data.

@dhimmel
Copy link
Member

dhimmel commented Oct 7, 2016

I am wondering whether we should specify the stratify parameter in train_test_split

Yes, we should stratify by status in train_test_split. cognoml will do this https://github.com/cognoma/machine-learning/pull/51/files#diff-4607d4a968832a9c9b0c00b2624655c0R56.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants