Scientific Programming Project (MSB1015)

For the Scientific Programming (MSB1015) course, an adjusted version of the Breast Cancer Wisconsin (Diagnostic) Data Set was analysed. This repository contains all the scripts that were used for this analysis.

Data
Research aim
Analysing the data
App
Contact

Data

The original Breast Cancer Wisconsin (Diagnostic) Data Set can be downloaded from Kaggle. However, for the current analysis a modified version of this data set was used. Contact me to access the adjusted data set.

The data set consist of 569 samples and includes the sample ID, the sample diagnosis (Malignant (M): 212 and Benign (B): 357), as well as 30 variables computed from a digitized image of a fine needle aspirate (FNA) of a breast mass. These 30 variables describe features from the cell nuclei in these images and encompasses the mean, standard error (SE), and the mean of the three largest values (worst) of the following 10 characteristics:

Radius: The mean of distances from center to points on the border of the cell nucleus.
Texture: The standard deviation of gray-scale values of the digitalized image.
Perimeter: The total length of the border of the cell nucleus.
Area: The size of the surface of the cell nucleus.
Smoothness: The local variation in radius lengths.
Compactness: Perimeter² / Area - 1.0
Concavity: The severity of concave portions of the contour of the cell nucleus.
Concave points: The number of concave portions of the contour of the cell nucleus.
Symmetry: Similarity of the radius length on both sides of the diameter.
Fractal dimension: Coastline approximation - 1

More information about the variables can be found on page 8 in this paper by Westerdijk (2018).

Research aim

The aim of the analysis is three-fold:

Construct a robust classifier to distinguish malignant from benign samples (Classification).
Identify subclasses within the malignant samples (Clustering).
Create an app for the prediction and visualization of new samples (App).

Analysing the data

When performing the analysis, be aware of the following:

Put the data file (Data.xlsx) into the main folder (..PATH../ScientificProgramming/).
Furthermore, it is important to run the scripts in the following order:
- Pre-processing/Preprocessing.R
- Classification/Classification.R
- Clustering/Clustering.R
- App
Finally, please follow the instructions in the scripts carefully to ensure a successful analysis.

App

To run the app in RStudio, click on "Run App" in the top right corner when having either the App/ui.R, App/server.R, or App/global.R file open in the RStudio window.

If this is not possible, run the following commands:

# Install the shiny package
install.packages("shiny")

# Load the shiny package
library(shiny)

# Run the shiny app
runApp("..PATH../ScientificProgramming/App")

Now you can use the classification model to predict the class of new samples!

Contact

Feel free to contact me via email: j.koetsier@student.maastrichtuniversity.nl

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Scientific Programming Project (MSB1015)

Data

Research aim

Analysing the data

App

Now you can use the classification model to predict the class of new samples!

Contact

Files

README.md

Latest commit

History

README.md

File metadata and controls

Scientific Programming Project (MSB1015)

Data

Research aim

Analysing the data

App

Now you can use the classification model to predict the class of new samples!

Contact