Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Classifier #5

Open
math-a3k opened this issue Jan 18, 2021 · 1 comment
Open

Classifier #5

math-a3k opened this issue Jan 18, 2021 · 1 comment

Comments

@math-a3k
Copy link
Owner

This issue is for discussing the Classifier.

Currently, the Classifier is handled at a django-ai level, not at covid-ht. Any discussion here should end in a django-ai action (change, implementation, etc.)

The requirements for the implementation of the Classifier are:

  • Be able to handle categorical data
  • Be able to handle numerical data
  • Be able to deal with missing values (NaNs)

scikit's Histogram-Based Gradient Boosting Tree is currently being used.

According to the simulated data included, it is able to achieve a ~~ 90% of accurracy (est. by 10-fold CV) with only 5 variables to take into account (RBC, WBC, PLT, NEUT, LYMPH) - while the rest are noisy / non-informative about class.

Any discussion for improvements about this (or another) classifier should go here.

@math-a3k
Copy link
Owner Author

Both Support Vector Machines and Neural Networks - in their vanilla versions - handle only numeric data (i.e. 'rbc'), not categorical (i.e. 'sex'). Although this can be overcome by encoding, those techniques have limitations and are not ideal.

Categorical data is particularly important for this problem, due to results varying by sex, age group, et al. Not taking into account such variables should lead to a "bad" (not accurate) classifier.

Classification Trees (CTs) have "built-in support" for both categorical data and missing data (though it may vary in the implementation), so, it it would make them the first choice for the problem (Logistic regression can't handle missing data, it has to be imputed).

They (CTs) also have the advantage of easily interpretation, but this is traded for better accuracy using Boosting. Although a CT can be 'graphed', it would take more time for person to 'follow the diagram' than to enter the values and let the machine do the classification. Further understanding of the data should be done 'outside' covid-ht, via the CSV data download, the goal of covid-ht is to do the best job at classification.

Being able to handle missing values should also be very important, specially for combining data from different sources, where different blood tests can be considered. The success of the classifier depends on the amount and quality of the data. Getting quality data may not be that easy, i.e. having a specific COVID19 testing at the same time that the blood is sampled and sharing it may require patient consent (although hemogram data is easily anonymized, it is still patient's data and thus requires consideration)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant