Here at Big Research Co.®, we love data so much - everyone wears Fitbits, even our employees! We believe these watches are the next step in the Big Data Industry and help enhance our current research. Our research spans from fitness equiptment and drug trials to very ethical human experimentation.
Everybody makes mistakes, even here at B.R.Co.®! Someone in a lab coat mixed up the labels for our Fitbit data, and one was left out. This Dr. Lab Coat handed me a usb containing the data in question. I need to determine what characteristics belong to the person wearing the Fitbit (did I mention it could be an employee or test subject?). That's not all, Dr. Lab Coat obscurley asked for predictions on the next two weeks of data that will be missing - missing because of the mix-up? Who knows, I just started here a week ago.
Deliverables
- Predictions.csv
- a file of predictions for the missing two weeks of data
- Analysis.ipynb
- notebook detailing the process to obtain my predictions and conclusions
- Prepare.ipynb
- notebook detailing the process to clean the data from raw to finished
- Summary of the data
- what was the individual like?
- Presentation
- two content slides
- at least one visual
Are you dying to know what's in the usb yet? Here's the low down below.
After using the Prepare.py module, the data frame will contain the columns as described below. Only the first two tables from the files were included, as a majority of the food log table had no data.
Column Name | Description |
---|---|
date | yyyy-mm-dd, df index |
cals_burned | calories burned for the day |
steps | steps taken in the day |
dist | distance walked, possibly in miles |
floors | uncertain, possible floors walked up or down |
mins_sedentary | minutes of the day sedentary |
mins_lightly_active | minutes of the day lightly active |
mins_fairly_active | minutes of the day fairly active |
mins_very_active | minutes of the day vary active |
activity_cals | uncertain, possibly calories burned due to active minutes |
month | month of the observation |
weekday | weekday of the observation |
Data within the usb came in eight seperate files - one for a month's worth of observations. To join the files together, each file was uploaded into one google sheets document with each file in its own sheet. The data was added all together in one sheet as well. The final sheet to be exported as a csv file did not include the food log as it contained more than 95% of zero values. Columns with numbers containing commas (like 2,345) were converted to integers with no commas. To look at this google sheet, click here.
The function to prep the data is within the Prepare.py module. After using the function, the data:
- date column is converted to a datetime type.
- index is set to the date column.
- contains additional columns for the observation month and weekday.
A seperate function in the Prepare.py module splits the data into:
- 50% train
- 30% validate
- 20% test
Additional prep was not neccessary, as the raw data was fairly clean. There were no nulls, and the only column with possible outliers is floors. More features may be added later during the explore and model process.
Univariate analysis was done on the whole dataframe to determine distributions of the individual features. Further analysis was completed on the train df with time series. Data was resampled by weekly, bi-weekly, and monthly periods to visualize trends, if any. Made conclusions on the individual based on this exploration.
Five different models were created and tested on the split dataframes. There was a:
- Simple average, the baseline
- predicts each future observation as the overall average
- Weekly rolling average
- predicts each future obsrvation as the most recent weekly average
- Monthly rolling average
- predicts each future observation as the most recent monthly average
- Holt's Linear Trend
- exponential smoothing applied to both the average and the trend
- one model with optimized = true
- one model with alpha = .1 and beta = .1
To determine which model was best for each feature (df column):
- Predicted Validate by fitting models on Train.
- Calculated RMSE. Created an evaluation df to hold the feature RMSE with the model name.
- Combined Train and Validate. Predicted Test by fitting models on the Train+Validate df.
- Calculated RMSE and added as new column in the evaluation df.
An example of the evaluation df:
Target Name | Model Name | rmse | test rmse |
---|---|---|---|
calories burned | simple average | 123.45 | 234.56 |
- Evaluated models for each feature with the eval_df, taking into consideration both rmse
- Once the models were choosen, predicted the next two weeks of unkown data fit on the whole df
- Combined the predictions in one dataframe to save as a .csv file
The final models chosen to predict the next two weeks of the data was a rolling 7-day average and Holt Optimized. These predictions are uploaded to this repo in the file titled Predictions.csv.
Create functions for modeling in a Model.py file to clean up the final notebook used.
- Read this Readme
- Download Prepare.py and Analysis.ipynb in your working directory
- Run the notebook or do your own exploration and modeling
Bethany Thompson
Feel free to reach out to me for any questions, comment, or suggestions!