Classifier #5

math-a3k · 2021-01-18T20:50:19Z

This issue is for discussing the Classifier.

Currently, the Classifier is handled at a django-ai level, not at covid-ht. Any discussion here should end in a django-ai action (change, implementation, etc.)

The requirements for the implementation of the Classifier are:

Be able to handle categorical data
Be able to handle numerical data
Be able to deal with missing values (NaNs)

scikit's Histogram-Based Gradient Boosting Tree is currently being used.

According to the simulated data included, it is able to achieve a ~~ 90% of accurracy (est. by 10-fold CV) with only 5 variables to take into account (RBC, WBC, PLT, NEUT, LYMPH) - while the rest are noisy / non-informative about class.

Any discussion for improvements about this (or another) classifier should go here.

math-a3k · 2021-01-27T04:19:36Z

Both Support Vector Machines and Neural Networks - in their vanilla versions - handle only numeric data (i.e. 'rbc'), not categorical (i.e. 'sex'). Although this can be overcome by encoding, those techniques have limitations and are not ideal.

Categorical data is particularly important for this problem, due to results varying by sex, age group, et al. Not taking into account such variables should lead to a "bad" (not accurate) classifier.

Classification Trees (CTs) have "built-in support" for both categorical data and missing data (though it may vary in the implementation), so, it it would make them the first choice for the problem (Logistic regression can't handle missing data, it has to be imputed).

They (CTs) also have the advantage of easily interpretation, but this is traded for better accuracy using Boosting. Although a CT can be 'graphed', it would take more time for person to 'follow the diagram' than to enter the values and let the machine do the classification. Further understanding of the data should be done 'outside' covid-ht, via the CSV data download, the goal of covid-ht is to do the best job at classification.

Being able to handle missing values should also be very important, specially for combining data from different sources, where different blood tests can be considered. The success of the classifier depends on the amount and quality of the data. Getting quality data may not be that easy, i.e. having a specific COVID19 testing at the same time that the blood is sampled and sharing it may require patient consent (although hemogram data is easily anonymized, it is still patient's data and thus requires consideration)

math-a3k mentioned this issue Mar 6, 2021

Upgrade to new django-ai version #15

Closed

math-a3k mentioned this issue Apr 8, 2021

Networking of covid-ht instances #16

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Classifier #5

Classifier #5

math-a3k commented Jan 18, 2021

math-a3k commented Jan 27, 2021

Classifier #5

Classifier #5

Comments

math-a3k commented Jan 18, 2021

math-a3k commented Jan 27, 2021