Author: Sai Prathyusha Kanisetti
Institution: George Washington University
Instructor: Prof. David W. Trott
Course: Machine Learning (CSCI 6364)
Date: 05/08/2024
Alcohol consumption poses significant public health challenges, with links to a range of physical and mental health disorders. This project leverages machine learning to classify individuals' drinking habits based on body signal data, contributing to better understanding and potential health interventions.
The dataset, sourced from Kaggle, contains:
- Observations: 991,346 (after cleaning: 906,676)
- Features: 24 (22 numerical, 2 categorical)
- Key metrics include hemoglobin, glucose levels, height, weight, etc.
Preprocessing:
- Duplicate removal (26 entries)
- Outlier analysis and removal:
- Waistline (e.g., 999.0)
- Cholesterol levels (e.g., HDL 8110, LDL 5119)
- Feature engineering (e.g., identifying blindness from eyesight metrics)
- Class Distribution: Balanced between drinkers and non-drinkers.
- Gender Analysis: Male participants are more likely to drink than females.
- Age Trends: Younger and middle-aged individuals consume more alcohol.
- Smoking-Alcohol Correlation: Non-smokers exhibit higher alcohol consumption.
- Which age group is most habituated to drinking alcohol?
- Does every individual who drinks also smoke?
- Is there a significant impact of alcohol on eyesight?
- Does regular alcohol consumption affect the liver?
Three algorithms were employed:
- Random Forest Classifier
- Gradient Boosting
- XGBoost
- Set 1: Gamma-GTP, HDL-cholesterol, age, smoking status, etc.
- Set 2: Added variables like left/right eyesight, triglycerides, and hemoglobin.
- XGBoost outperformed others with robust predictive accuracy and efficiency.
- Cross-validation yielded a mean score of ~0.735 with minimal variance.
XGBoost proved most effective due to its:
- High accuracy
- Resource efficiency
- Versatility in handling noisy and incomplete datasets
The insights derived from this project underscore the critical health impacts of alcohol and highlight the value of machine learning in public health research.
- Data: [Dataset from Kaggle] (not included here)
- Scripts: Preprocessing, EDA, and model training files
- Models: Trained models for Random Forest, Gradient Boosting, and XGBoost
- Preprocess the dataset:
- Remove duplicates and outliers.
- Engineer relevant features.
- Perform EDA to gain insights into class distributions and trends.
- Train models using the scripts provided.
- Evaluate models with metrics like F1-score and ROC-AUC curves.
- Extend analysis to other health metrics.
- Explore deep learning models for feature representation.
- Implement real-time predictions in a healthcare setting.