For the Scientific Programming (MSB1015) course, an adjusted version of the Breast Cancer Wisconsin (Diagnostic) Data Set was analysed. This repository contains all the scripts that were used for this analysis.
The original Breast Cancer Wisconsin (Diagnostic) Data Set can be downloaded from Kaggle. However, for the current analysis a modified version of this data set was used. Contact me to access the adjusted data set.
The data set consist of 569 samples and includes the sample ID, the sample diagnosis (Malignant (M): 212 and Benign (B): 357), as well as 30 variables computed from a digitized image of a fine needle aspirate (FNA) of a breast mass. These 30 variables describe features from the cell nuclei in these images and encompasses the mean, standard error (SE), and the mean of the three largest values (worst) of the following 10 characteristics:
- Radius: The mean of distances from center to points on the border of the cell nucleus.
- Texture: The standard deviation of gray-scale values of the digitalized image.
- Perimeter: The total length of the border of the cell nucleus.
- Area: The size of the surface of the cell nucleus.
- Smoothness: The local variation in radius lengths.
- Compactness: Perimeter2 / Area - 1.0
- Concavity: The severity of concave portions of the contour of the cell nucleus.
- Concave points: The number of concave portions of the contour of the cell nucleus.
- Symmetry: Similarity of the radius length on both sides of the diameter.
- Fractal dimension: Coastline approximation - 1
More information about the variables can be found on page 8 in this paper by Westerdijk (2018).
The aim of the analysis is three-fold:
- Construct a robust classifier to distinguish malignant from benign samples (Classification).
- Identify subclasses within the malignant samples (Clustering).
- Create an app for the prediction and visualization of new samples (App).
When performing the analysis, be aware of the following:
- Put the data file (
Data.xlsx
) into the main folder (..PATH../ScientificProgramming/
). - Furthermore, it is important to run the scripts in the following order:
Pre-processing/Preprocessing.R
Classification/Classification.R
Clustering/Clustering.R
- App
- Finally, please follow the instructions in the scripts carefully to ensure a successful analysis.
To run the app in RStudio, click on "Run App" in the top right corner when having either the App/ui.R
, App/server.R
, or App/global.R
file open in the RStudio window.
If this is not possible, run the following commands:
# Install the shiny package
install.packages("shiny")
# Load the shiny package
library(shiny)
# Run the shiny app
runApp("..PATH../ScientificProgramming/App")
Feel free to contact me via email: j.koetsier@student.maastrichtuniversity.nl