Skip to content

GeorgeNich/Data-Mining-and-Machine-learning

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

36 Commits
 
 
 
 
 
 

Repository files navigation

Data-Mining-and-Machine-Learning

Overview

This repository showcases a series of projects and assignments in the field of data mining and machine learning, with a particular focus on analyzing the Pima Indian Diabetes dataset. It includes detailed explorations of data cleaning, visualization techniques, and the development of machine learning models aimed at providing insightful analyses and predictions. This collection of work is a reflection of my skills and understanding in applying data science methodologies to real-world datasets.

Projects

  • Description: This project involves extensive data cleaning and visualization techniques to prepare the dataset for predictive modeling.
  • Tools & Technologies: Python, Pandas, Matplotlib, Seaborn
  • Description: Building on the cleaned dataset, this project develops a machine learning model to predict the likelihood of diabetes in the Pima Indian population.
  • Tools & Technologies: Python, Scikit-learn, Jupyter Notebook

Key Features

  • In-depth analysis of the Pima Indian Diabetes dataset, focusing on identifying key patterns and relationships.
  • Comprehensive data cleaning and visualization to prepare the dataset for predictive modeling.
  • Application of various data preprocessing techniques, including Principal Component Analysis (PCA) for dimensionality reduction.
  • Exploration of multiple machine learning models for classification, including KNN and decision tree models.
  • Use of advanced model selection techniques like stratified sampling in KNN and 10-fold cross-validation in decision tree models.
  • Implementation of supervised feature selection techniques and the filter method to enhance model accuracy and efficiency.
  • Extensive evaluation of model performance using metrics like accuracy, precision, recall, and F1-score to ensure robustness and reliability.
  • Dedication to optimizing model parameters and methodology to achieve the highest possible prediction accuracy.

How to Navigate this Repository

  • Explore the Data Cleaning and Visualising folder for notebooks and data files related to the initial stages of the data science pipeline from data collection, visualisation and analysis, .
  • Visit the Developing a Machine Learning model folder for machine learning techniques applied to the processed dataset and accuracy prediction models to predict diabetes in this case.

Acknowledgments

(I want to Acknowledge my tutor and colleagues for the experience of working with them throughout the year to unlock my full potential in completing these projects, all the best to them)

About

my work on Data mining and machine learning

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published