CHECK THIS OUT ON DATABRICKS --> https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/173542347700804/2929076465318146/6176203754563543/latest.html
The project was carried out using Apache Spark on Databricks, utilizing Python and SQL.
The goal of this project is to analyze the Body Fat Dataset and generate predictive insights. (dataset - https://www.kaggle.com/datasets/fedesoriano/body-fat-prediction-dataset)
- Density determined from underwater weighing
- Percent body fat from Siri's (1956) equation
- Age (years)
- Weight (lbs)
- Height (inches)
- Neck circumference (cm)
- Chest circumference (cm)
- Abdomen circumference (cm)
- Hip circumference (cm)
- Thigh circumference (cm)
- Knee circumference (cm)
- Ankle circumference (cm)
- Biceps (extended) circumference (cm)
- Forearm circumference (cm)
- Wrist circumference (cm)
- Loading and preprocessing of the dataset
- Statistical analysis of the data
- Exploratory Data Analysis to uncover patterns and insights
- Correlation Analysis to understand relationships between variables
- Utilizing tree models to predict Body Fat percentage The Root Mean Squared Error (RMSE) for each model on the test data was:
- Linear Regression: 0.622103
- Decision Tree Regression: 0.96897
- Gradient-Boosted Tree Regression: 0.891016 These results highlight the effectiveness of the Linear Regression model in predicting Body Fat percentage, outperforming both Linear Decision Tree Regression and Gradient-Boosted Tree Regression models.