Problem Statement

The Prostate Cancer Classification task simply asks for a predictive model which, given the data, successfully suggests if a particular case is cancerous or not. An effective model can help speed up the process of cancer detection and can help provide almost accurate preliminary results.

Dataset

The dataset used is the Prostate Cancer Classification from (Kaggle). This is the dataset of 100 patients to implement the machine learning algorithm and thereby interpreting results. The data set consists of 100 observations and 10 variables (out of which 8 numeric variables and a categorical variable and an ID):

Id
Radius
Texture
Perimeter
Area
Smoothness
Compactness
diagnosis_result
Symmetry
Fractal dimension

The class labels are:

1. M: Malicious. May be cancerous.
2. B: Benign. May not be cancerous.

Models Used

1. K Nearest Neighbours: This model is most useful in data which can be linearly separated. It simply finds the "K nearest neighbours" and uses the highest class occurence as the final class preidction.

2. Support Vector Classifier: This model is also very useful when the data is linearly separable. Although, SVMs are capable of projecting the data into greater dimensions to find out better patterns.

3. Decision Tree Classifier: This model identifies the most informative attribute at every level and uses it to make a tree. The final tree can then be used as a simple if-else statement to identify the final prediction.

Future Work

We can perform Hyperparameter Tuning on the models used to find out how much the accuracy can be improved.
We can try using Neural Networks.
We can perform regression instead of classification to get the probabilities. Once we have them, we can set a threshold suggesting the classes more appropriately.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Problem Statement

Dataset

Models Used

Future Work

Files

README.md

Latest commit

History

README.md

File metadata and controls

Problem Statement

Dataset

Models Used

Future Work