Skip to content

Latest commit

 

History

History
161 lines (124 loc) · 6.34 KB

README.md

File metadata and controls

161 lines (124 loc) · 6.34 KB

Breast Cancer Survival Prediction

Predicting breast cancer survival using machine learning models with clinical data and gene expression profiles.

About The Project

This is the project for the "Introduction to Machine Learning" course at Sharif University of Technology. In this project, we are going to predict the survival of patients with breast cancer using machine learning models with clinical data and gene expression profiles. The dataset used in this project is the Breast Cancer Gene Expression Profiles (METABRIC).

First, we did some preprocessing and EDA on the clinical data and gene expression profiles. Then, we investigated whether it is needed to use a dimensionality reduction approach or not. Afterwards, we used different machine learning models (Classical and Neural Network methods) to predict the survival of patients. Finally, we compared the results of different models and different datasets.

Results

First, we analyzed effect of age on survival. From the charts below we can have the following interpretation that if the disease is diagnosed at early ages there is a better chance that the patient survive, but, we have to consider the fact that the death from the disease somehow overlaps with the living box and that chance isn't that much. At least we can say that diagnosis after age 85 is not effective.

Age

Then, we analyzed effect of tumor stage on survival. From the chart below, we can tell that if the patient is in stage 1 the probability of surviving is much higher if she was in stage 2, 3 or 4. Although in stage 2 the chance of surviving is pretty high.

Stage

The next step was to analyze the effect of tumor size on survival. We can say that the every tumor sizes kills! But when it's bigger, it will kill more. Also, tumors with bigger sizes are probable to be in stages 3 that considering the results in the section above, they will be more dangerous.

Size

Different therapy methods have different effects on survival. There are three therapy methods mentioned in the data set

  1. chemotherapy
  2. radio_therapy
  3. hormone_therapy

From the charts below we can find out:

  • Every patient tried at least one type of therapy
  • radio therapy and hormone therapies were more effective and was used together more frequently.

Therapy

Finally, we analyzed the correlation between different features and survival.

Correlation

Dimensionality Reduction

We used PCA and UMAP to reduce the dimensionality of the gene expression profiles.

Classical Models

We implemented classic classification models for clinical, gene expressions, and reduced gene expressions datasets. We used scikit-learn library to build and train the models. We didn't stopped at Random Forest model and examined different methods for the datasets and compared them together. This work is done for each dataset (i.e. clinical, gene expression and reduced gene expression). The models that are used are as follow:

  • Random Forest
  • Logistic Regression
  • AdaBoost
  • XGBoost
  • Support Vector Classifier

The classification result on clinical data is as follow:

========== Train =========================
      Random Forest :  1.00
      Logistic Regression :  0.79
      Ada Boost :  0.87
      XGBoost :  0.90
      SVC :  0.87
========== Test ==========================
      Random Forest :  0.76
      Logistic Regression :  0.71
      Ada Boost :  0.75
      XGBoost :  0.75
      SVC :  0.76

And here are the RoC Curves:

Clinical RoC

The classification result on gene expression data is as follow:

========== Train =========================
      Random Forest :  0.95
      Logistic Regression :  1.00
      Ada Boost :  0.98
      XGBoost :  1.00
      SVC :  0.93
========== Test ==========================
      Random Forest :  0.64
      Logistic Regression :  0.63
      Ada Boost :  0.58
      XGBoost :  0.66
      SVC :  0.70

And here are the RoC Curves:

Gene Expression RoC

Finally, the classification result on reduced gene expression data is as follow:

========== Train =========================
      Random Forest :  0.80
      Logistic Regression :  0.60
      Ada Boost :  0.80
      XGBoost :  0.85
      SVC :  0.60
========== Test ==========================
      Random Forest :  0.65
      Logistic Regression :  0.64
      Ada Boost :  0.60
      XGBoost :  0.62
      SVC :  0.66

And here are the RoC Curves:

Reduced Gene Expression RoC

Neural Network Models

We implemented a neural network model for clinical, gene expressions, and reduced gene expressions datasets.

The confusion matrix for different datasets are as follow:

Clinical Gene Expression Reduced Gene Expression

Conclusion

According to the results, classical models work better on the given datasets. Because the data is sparse and building a complex model over it results in overfitting. So, simpler models such as Random Forrest and XGBoost reach better results on Clinical and Reduced Gene Expression datasets and on par on Gene Expression dataset.

Comparison

Contributors