-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Classifier #5
Comments
Both Support Vector Machines and Neural Networks - in their vanilla versions - handle only numeric data (i.e. 'rbc'), not categorical (i.e. 'sex'). Although this can be overcome by encoding, those techniques have limitations and are not ideal. Categorical data is particularly important for this problem, due to results varying by sex, age group, et al. Not taking into account such variables should lead to a "bad" (not accurate) classifier. Classification Trees (CTs) have "built-in support" for both categorical data and missing data (though it may vary in the implementation), so, it it would make them the first choice for the problem (Logistic regression can't handle missing data, it has to be imputed). They (CTs) also have the advantage of easily interpretation, but this is traded for better accuracy using Boosting. Although a CT can be 'graphed', it would take more time for person to 'follow the diagram' than to enter the values and let the machine do the classification. Further understanding of the data should be done 'outside' covid-ht, via the CSV data download, the goal of covid-ht is to do the best job at classification. Being able to handle missing values should also be very important, specially for combining data from different sources, where different blood tests can be considered. The success of the classifier depends on the amount and quality of the data. Getting quality data may not be that easy, i.e. having a specific COVID19 testing at the same time that the blood is sampled and sharing it may require patient consent (although hemogram data is easily anonymized, it is still patient's data and thus requires consideration) |
This issue is for discussing the Classifier.
Currently, the Classifier is handled at a django-ai level, not at covid-ht. Any discussion here should end in a django-ai action (change, implementation, etc.)
The requirements for the implementation of the Classifier are:
scikit's Histogram-Based Gradient Boosting Tree is currently being used.
According to the simulated data included, it is able to achieve a ~~ 90% of accurracy (est. by 10-fold CV) with only 5 variables to take into account (RBC, WBC, PLT, NEUT, LYMPH) - while the rest are noisy / non-informative about class.
Any discussion for improvements about this (or another) classifier should go here.
The text was updated successfully, but these errors were encountered: