- Introduction and Motivation
- Features
- Dataset
- Technologies Used
- Model Building and Training
- Model Validation and Testing
- Model Evaluation
- Visualizations
- Key Findings
- Report Breakdown
- Interpretation
- Impact
- How to Use
- Contact
- License
This project aims to develop a robust predictive model for assessing the risk of cardiovascular disease (CVD) using the random forest algorithm. Cardiovascular diseases are among the leading causes of mortality globally. Early risk prediction can significantly improve outcomes through timely intervention and preventative measures. This project combines data exploration, visualization, and machine learning to provide insights into CVD risk factors and predict individual risk levels.
- Data Exploration and Visualization: Utilizing seaborn and Matplotlib for in-depth analysis and visualization of dataset attributes.
- Data Preprocessing: Employing Scikit-learn for data cleaning and preparation tasks.
- Predictive Modeling: Building and training a random forest classifier to accurately predict CVD risk.
- Python
- numpy
- pandas
- Scikit-learn
- Seaborn
- Matplotlib
The dataset used in this project is comprehensive, covering a wide range of variables associated with cardiovascular disease (CVD) risk factors. Each record in the dataset represents individual respondents with the following attributes:
- General_Health: Overall health rating of the individual (e.g., Excellent, Very Good, Good, Fair, Poor).
- Checkup: Frequency of medical checkups (e.g., Within last year, Within past two years, etc.).
- Exercise: Information on physical activity levels. (Yes/No)
- Heart_Disease: Presence of heart disease (Yes/No).
- Skin_Cancer: History of skin cancer (Yes/No).
- Other_Cancer: History of any cancer other than skin cancer (Yes/No).
- Depression: Indicator of depression (Yes/No).
- Diabetes: Indicates if the individual has diabetes (Yes/No).
- Arthritis: Indicates if the individual has arthritis (Yes/No).
- Sex: Biological sex of the respondent.
- Age_Category: Age range category of the respondent.
- Height_(cm): Height of the individual in centimeters.
- Weight_(kg): Weight of the individual in kilograms.
- BMI: Body Mass Index calculated from height and weight.
- Smoking_History: Smoking habits (Yes/No).
- Alcohol_Consumption: Quantity of alcohol consumption.
- Fruit_Consumption: Quantity of fruit consumption habits.
- Green_Vegetables_Consumption: Quantity of green vegetable consumption.
- FriedPotato_Consumption: Quantity of fried potato consumption.
This dataset provides a holistic view of factors that could influence the risk of developing cardiovascular diseases, allowing for detailed analysis and modeling to predict CVD risk.
The predictive model for cardiovascular disease risk prediction was constructed using the RandomForestClassifier from sklearn, leveraging its capabilities for handling complex datasets with a mix of categorical and numerical data.
- Algorithm: RandomForestClassifier
- Key Parameter:
n_estimators
was set to specify the number of trees in the forest, chosen based on preliminary validation to balance between overfitting and computational efficiency.
- Training Data: The model was trained using
X_train
for the input features, encompassing a diverse range of variables such as age, gender, blood pressure, and cholesterol levels. - Target Variable:
y_train
represented the presence or absence of cardiovascular disease, serving as the output parameter for the model. - Methodology: Employed a robust training methodology to ensure the model accurately captures the underlying patterns without overfitting to the training data.
The RandomForestClassifier was chosen for its efficacy in classification tasks, its intrinsic ability to manage overfitting, and its feature importance capabilities, which are instrumental for understanding the predictive power of the various risk factors involved in cardiovascular disease.
- Prediction: Applied the trained model to the testing dataset to predict cardiovascular disease risk.
- Probability Assessment: Computed prediction probabilities, offering insights into model confidence levels for each prediction.
- Classification Report: Generated a detailed classification report using sklearn, providing precision, recall, and f1-score metrics for a comprehensive performance assessment.
- ROC Score: Calculated the receiver operating characteristic (ROC) score, measuring the model's ability to distinguish between classes.
- The RandomForestClassifier demonstrated promising accuracy in predicting cardiovascular disease risk, as evidenced by the classification report and ROC score.
- Precision: 92% of instances predicted as class 0 are actually class 0.
- Recall: The model correctly identifies 100% of all actual class 0 instances.
- F1-Score: 96%, indicating a very high balance between precision and recall for class 0.
- Support: There are 85,134 actual instances of class 0 in the dataset.
- Precision: 48% of instances predicted as class 1 are actually class 1.
- Recall: Only 2% of the actual class 1 instances were correctly identified by the model.
- F1-Score: 5%, indicating a poor balance between precision and recall for class 1.
- Support: There are 7,523 actual instances of class 1 in the dataset.
- Overall, the model correctly predicted 92% of all cases. However, this metric can be misleading for imbalanced classes.
- Precision: Average precision across both classes without considering class imbalance is 70%.
- Recall: Average recall across both classes is 51%.
- F1-Score: Average F1 score is 50%.
- Accounts for class imbalance by weighting the average based on the number of instances in each class.
- Precision: 88% considering class imbalance.
- Recall: Same as accuracy, 92%.
- F1-Score: 88%, considering class imbalance.
While the model performs exceptionally well on class 0 (likely the majority class), it struggles significantly with class 1, as indicated by the low recall and F1-score for class 1. This suggests the model is biased towards the majority class and has difficulties identifying the minority class (class 1), which is a common issue in imbalanced datasets.
- Predictive Power: This model significantly enhances my ability to predict cardiovascular disease risk, potentially informing more targeted preventative measures.
- Model Confidence: Probability assessments provide valuable insights into the model's confidence in its predictions, guiding clinical decision-making processes.
This project is designed to be accessible and straightforward to run using Jupyter Notebooks, a popular tool in data science for interactive computing.
To run the cvd_risk_prediction.ipynb
notebook, you'll need to have Python installed on your system along with Jupyter Notebook or JupyterLab. It's also recommended to use a virtual environment for Python projects to manage dependencies effectively.
- Clone the Repository: Start by cloning this repository to your local machine.
git clone https://github.com/W0474997SteveArmstrong/cardiovascular-disease-risk-prediction.git cd cardiovascular-disease-risk-prediction
- Create a Virtual Environment (Optional but recommended):
- For conda users:
conda create --name cvd_risk_prediction python=3.8 conda activate cvd_risk_prediction
- For venv users:
python3 -m venv cvd_risk_prediction source cvd_risk_prediction/bin/activate # On Windows use `cvd_risk_prediction\Scripts\activate`
- For conda users:
- Install Required Packages
pip install numpy pandas jupyterlab matplotlib seaborn scikit-learn
- Running the Notebook
- Navigate to the Notebook Directory: Change directory to the
notebooks
folder.cd notebooks
- Launch Jupyter Notebook
jupyter notebook
- Navigate to the Notebook Directory: Change directory to the
- Open
cvd_risk_prediction.ipynb
in the Jupyter Notebook interface and follow the instructions within the notebook to run the analyses.
The notebook includes detailed comments and visualizations to help you understand each step of the process, from data exploration to model evaluation. Here's what to look for:
- Data Exploration and Visualization: Initial sections of the notebook provide insights into the dataset's structure and distribution of variables.
- Model Training: Look for the section where the RandomForestClassifier is trained with the cvd_cleaned.csv dataset.
- Model Evaluation: The final sections will show the model's performance on the test set, including accuracy, precision, recall, and the ROC score. Interpret these metrics to gauge the model's effectiveness in predicting cardiovascular disease risk.
For any questions or discussions, feel free to contact me at steve@stevearmstrong.org.
This project is licensed under the MIT License - see the LICENSE.md file for details.