Classifying-as-Benign-or-Malignant-with-Decision-Trees

REFERENCES:

Induction of Decision Trees - Research Paper
Scikit Learn decision tree documentation
Breast Cancer dataset - UCI Machine learning database
Medium article for modeling decision trees
Hands-On Machine Learning book for decision trees and random forests
Medium article for Seaborn pair plots

Here are some of the pros and cons of decision trees:

Advantages

Decision trees are logarithmic in cost, meaning that this is not very intensive, especially when used on high-features datasets in comparison with other models.
White box, meaning that we can actually understand how this works
Minimal data preparation needed

Disadvantages

Prone to overfit, we could prune to see if that helps
Slightly unstable, there might be better trees that represent the true population better
Requires a balanced dataset without "inadequate" attributes. Meaning we can't have several examples where the same set of attributes are indicative of different classes

This is a dataset with 10 features, that describe features derived from images of breast masses. They describe the characteristics of the cell nuclei

Each object in the dataset has a value of 2 or 4, 2 for benign and 4 for objects that are malignant.
The goal is to correctly classify an object based on its attributes.

Data cleaning Pipeline with Scikit-Learn

We create a class using BaseEstimator to include in a scikit learn pipeline that deals with certain values in the data, primarily making certain values numeric, and dealing with the nulls
The class does not need to have code that gets initialized. Because this is a transformer, we don't need to add any code in the fit function

We also add a MinMaxScaler in the pipeline so this can be used for other models we might make as well, and to deal with the diagnosis column (which has values as 2 or 4 currently). The result is that malignant masses will have a value of 1, and 0 for benign masses

The data

A pair plot that shows the relationships between the size attributes. I thought this was important because from my personal knowledge, size is a common indicator of malignant tumors

Below is the distribution of (some!) of the features, split by the benign and malignant. As you can see, malignant tumors are generally associated with larger values for the given attributes

Another thing I checked was the "adequacy" of the data. Based on a research paper, "Induction of Decision Trees" written by J.R. Quinlan, it is ideal for decision trees to be used in datasets where the same set of attributes of an object, always results in the same classification We check this in the code, finding that though there are some duplicates (where attributes are all the same), there are no instance where the attributes are the same BUT the classification is different.

The model

The model we use is a classification decision tree from scikit-learn which uses the CART algorithm.

Meaning that these are binary trees, where each leaf only has up to two children.
If we built this with an ID3 algorithm based model, we might see different results

Running it initially, we find that there is a depth of 9, and an accuracy of 94.16%. We improve on this by customizing some of the default hyper-parameters in scikitlearn. First we increase the values of min_samples_leaf and min_samples_split, which increases the value of accuracy to 95.62%. Just sets some requirements for when to split and for leaves to be created

Doing this is a form of regularization, that reduces overfitting of the model when running it against the test set. In general, increasing the min_* features and decreasing the max_* features will regularize the model.

Note that these values might change because there is some probability in how trees are constructed, leading to slightly different results.

Finally, we see that the most important factor seems to be the size uniformity.

Hyperparameter tuning

We further tune the hyperparameters by optimzing the max_depth. We test depths from 4 to 9, seeing as how there was depth of 9 when no limit was set. This is another form of preventing overfitting.

Because of the differences from each run, we run the models 100 times and take the average to get a more realistic model accuracy from each hyperparameter set

Conclusion

We find that the optimal max_depth is 6! by a small margin. By regularizing and tuning, we were able to improve the overall model accuracy to nearly 95%

Random Forest

Built a random forest from scratch that has a few hyperparameters we can use for adjusting the forest construction. Main ones to note are feature_split and max_samples.

max_samples affects how many samples are used in the bootstrapping method for each tree in the forest
feature_split defines how many features each tree will be randomly assigned. We use the squareroot, log, and all features depending on the parameter.

By implementing a random forest with 128 trees, we actually see an increase of accuracy on the testing data from 95% to nearly 98%!

We also can access items in the cache of the forest, including the original predictions for each tree on each sample.

Thanks for reading:D

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
Predicting_breast_cancer_with_decision_trees (2).ipynb		Predicting_breast_cancer_with_decision_trees (2).ipynb
Predicting_breast_cancer_with_decision_trees.ipynb		Predicting_breast_cancer_with_decision_trees.ipynb
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Classifying-as-Benign-or-Malignant-with-Decision-Trees

Here are some of the pros and cons of decision trees:

Data cleaning Pipeline with Scikit-Learn

The data

The model

Hyperparameter tuning

Conclusion

Random Forest

About

Releases

Packages

Languages

bhulston/Random-Forest-on-Cancer-Classification

Folders and files

Latest commit

History

Repository files navigation

Classifying-as-Benign-or-Malignant-with-Decision-Trees

Here are some of the pros and cons of decision trees:

Data cleaning Pipeline with Scikit-Learn

The data

The model

Hyperparameter tuning

Conclusion

Random Forest

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages