Preventing heart disease is important. Good data-driven systems for predicting heart disease can improve the entire research and prevention process, making sure that more people can live healthy lives.
In the United States, the Centers for Disease Control and Prevention is a good resource for information about heart disease. According to their website:
- About 610,000 people die of heart disease in the United States every year–that’s 1 in every 4 deaths.
- Heart disease is the leading cause of death for both men and women. More than half of the deaths due to heart disease in 2009 were in men.
- Coronary heart disease (CHD) is the most common type of heart disease, killing over 370,000 people annually.
- Every year about 735,000 Americans have a heart attack. Of these, 525,000 are a first heart attack and 210,000 happen in people who have already had a heart attack.
- Heart disease is the leading cause of death for people of most ethnicities in the United States, including African Americans, Hispanics, and whites. For American Indians or Alaska Natives and Asians or Pacific Islanders, heart disease is second only to cancer.
Data is provided courtesy of the Cleveland Heart Disease Database via the UCI Machine Learning repository.
Our goal is to predict the Binary class "heart disease present", which represents whether or not a patient has heart disease.
- 0 represents no presence of heart disease.
- 1 represents presence of heart disease.
We have to use several machine learning algorithms for our project such as support vector machines, linear-SVC, logistic regression, K-nearest neighbors, decision trees, Random forest, and Xgboost classifier, and choose the model that works best for our dataset/project.
Deep learning models such as Tensorflow-keras Sequential model had to be used in our project, to determine its capability in classifying a patient with or without heart disease.
- All the ML algorithms performed exceptionally well for our dataset by classifying all appropriate patients with or without heart disease. Especially among those models Random Forest classifier and Gradient boosting classifier gave the best results for our project.
- Both Random forest classifier and Gradient boosting classifier models achieved an accuracy score of 0.917, F1_score of 0.914, Precision score of 0.842 and Recall score of 1.0.
- Both the models performed well on Cross-Validation of 10 folds. Random-forest-classifier model has a cross-validation score of 0.82, and Gradient-boosting-classifier has a cross-validation score of 0.80.
- Deep learning model(Tensorflow-keras sequential model) used gave us an accuracy score of 0.861,precision of 0.824, recall of 0.875 and f1_score of 0.848.
- We recommended either of the two models(Random forest classifier or Gradient boosting classifier) for our project.
Feel free to give us any suggestions regarding improvisation /corrections of our code or any mistakes/procedures done incorrectly in our project notebook. We are on the path of learning and understanding the concept of Machine learning and Data_science, so any feedback regarding this topic from any of you who worked in these types of projects is beneficial to us in correcting our mistakes and implementing the right procedures for our project.