-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathAssignment 1, RMD.Rmd
346 lines (263 loc) · 21.2 KB
/
Assignment 1, RMD.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
---
title: "JHU5, Assignment 1"
author: "Bryan Murphy"
date: "2023-04-03"
output:
prettydoc::html_pretty:
theme: hpstr
highlight: github
---
```{r setup, include=TRUE}
# This is our initial setup block, named setup, with include = TRUE so it will show up.
# First we'll set any global variables we care about...
knitr::opts_chunk$set(echo = TRUE)
options(scipen=999) # This stops knitr from displaying 5 digit numbers as Scientific Notation.
# And then load our data.
actdata <- read.csv("repdata_data_activity/activity.csv")
# And the only library call we'll need:
library(tidyverse)
```
<h2 align = "center"> **Q1: What is the mean total number of steps taken per day? ** </h2>
In order to answer this question, we first have to manipulate our base dataset, which I've named `actdata` (for *"activity data"*) to group by date, then summarize this group date by the variable "steps". This will give us our daily step count output, such as might be tracked by a device like a Fitbit. We see the code to accomplish this transformation below. In this variable, I've also replaced all `NA` entries with the number $0$, to make it possible to use this daily step count data for subsequent transformations. I'll thus call our grouped, cleaned dataset `CleanDailySteps`. Notice this code uses the piping operator, made possible because of our call `library(tidyverse)` in the setup code chunk at the beginning of this RMD file.
```{r}
group_by(actdata, date) %>%
summarize(DailyStepCount = sum(steps)) %>%
replace_na(list(date = 0, DailyStepCount = 0)) -> CleanDailySteps
```
Looking at the structure of this file: `{r} ` - we can see that we have created a $61 by 2$ tibble with two columns, *date* and *DailyStepCount.* This lets us know that the dataset runs across 61 days, which will be important when calculating any averages of the step data.
```{r}
str(CleanDailySteps)
```
We can also look at the `head` of our dataset to see what it looks like:
```{r}
head(CleanDailySteps)
```
Let's look at some descriptive statistics for `DailyStepCount`, including, most importantly, **the mean total number of steps taken per day** across the 61 days in our dataset.
```{r}
mean(CleanDailySteps$DailyStepCount) #Mean total steps per day
sum(CleanDailySteps$DailyStepCount)/61 #Calculating mean manually
median(CleanDailySteps$DailyStepCount) #Median total steps per day
summary(CleanDailySteps)
```
We see that the mean number of steps per day is $9,354$, while the median is $10,395$, telling us that this data set is **skewed to the left**, with a number of low or zero step count days pulling down the mean of the column DailyStepCount.
#### **A histogram of the total number of steps taken each day.**
While we can make a histogram using the bar chart geometry with ggplot2, we're going to use `geom_histogram` to more clearly demonstrate that we are creating a histogram specifically. We're going to use `+theme` modifiers to remove all vertical gridlines and reformat the horizontal gridlines to make the chart a little easier to look at.
```{r warning=FALSE}
ggplot(CleanDailySteps, aes(x=DailyStepCount)) +
geom_histogram(bins = 20, fill = "navajowhite", color = "midnightblue") +
labs(title = "Histogram of Daily Step Counts, 20 Bins", y = "Count", x = "Total Steps Taken / Day") +
theme(plot.title = element_text(hjust = 0.5)) +
theme(panel.background = element_rect(fill = "lightskyblue1"),
plot.background = element_rect(fill = "lightskyblue1"),
panel.ontop = FALSE) +
theme(panel.grid.major.x = element_blank(),
panel.grid.minor.x = element_blank(),
panel.grid.major.y = element_line(color = "red4",
size = 0.75,
linetype = 2),
panel.grid.minor.y = element_line(color = "red4",
size = 0.25,
linetype = 2))
```
<h2 align="center">**Q2: What is the average daily activity pattern?**</h2>
Just from the histogram above, we can see that the daily activity pattern follows a somewhat normal, albeit very left skewed, distribution, with relatively fewer unusually high and unusually low step count days, but a large number of 0 step count days (which could possibly represent something like days where the person from whom the data was being collected forgot to wear their Fitbit or other tracking device).
But we can visualize the daily activity pattern across the hours of an individual day, as well.
**Creating the dataset that will allow use to look at the activity level across the span of a single day**
In order to be able to look at the average activity level per 5-minute interval for an average day, we need to group our original dataset, `actdata`, by the variable `interval`, sum the total steps for each `interval` across the 61 days of the dataset, and then divide this sum by 61 to get the average number of steps taken during that 5-minute interval on an average day.
The code below shows the steps we need to take to accomplish this transformation,resulting in the creation of a data frame, `interval_avgs`, that we can plot as a time series. Note that the very first step is removing the *NAs* from `actdata` and replacing them with $0's$ so as to prevent breaking any subsequent operations.
We continue the practice of using the piping operator `%>%` to make the linear nature of the transformation operations more obvious, and to avoid creating unnecessary intermediate variables that don't actually need to exist permanently.
```{r}
CleanActData <- replace_na(actdata,list(steps = 0, date = 0, interval = 0 ))
CleanActData %>% select(steps, interval) %>%
group_by(interval) %>%
summarize(TotSteps = sum(steps)) %>%
mutate(AvgSteps = TotSteps/61) -> interval_avgs
head(interval_avgs)
```
Now we can use ggplot2 to plot `interval_avgs` with the 5-minute interval on the x-axis and the average number of steps taken during that interval on the y-axis.
```{r}
ggplot(interval_avgs, aes(x = interval, y = AvgSteps)) +
geom_line(color = "red2", linetype = 1) +
labs(title = "Average Steps Taken During Each \nFive Minute Interval Across All Days", y = "Average STeps Taken", x = "Time of 5-Minute Interval During the Day") +
theme(panel.background = element_rect(fill = "lavenderblush1"),
plot.background = element_rect(fill = "lavenderblush1"),
panel.ontop = FALSE) +
theme(plot.title = element_text(hjust = 0.5)) +
theme(panel.grid.major.x = element_line(color = "lightsteelblue4"),
panel.grid.minor.x = element_blank(),
panel.grid.major.y = element_line(color = "lightsteelblue4",
size = 0.75,
linetype = 2),
panel.grid.minor.y = element_blank())
```
We can see that the user(s) this data was collected from tends to wake up a little bit after 5 A.M. each day,their activity tends to peak around 8 or 9 am, and they then remain fairly consistently active from 10 am until around 7 pm or so, and after 8 pm their activity drops dramatically, possibly indicating they tend to go to sleep within a few hours of that time.
The 5-minute interval, averaged across all the days in the dataset, that tends to contain the maximum number of steps is `r interval_avgs[interval_avgs$AvgSteps == max(interval_avgs$AvgSteps),1]` am, corresponding to `r round(interval_avgs[interval_avgs$AvgSteps == max(interval_avgs$AvgSteps),3])` steps taken during this 5 minute period, on average. Maybe this time period represents part of the participant(s) daily commute, for example.
<h1 align="center">**Imputing missing values** </h1>
This was the trickest part of the assignment, in my opinion.
Looking at the data, we can see that there are a significant number of observations in the dataset, exclusively in the "steps" column.
```{r}
sapply(actdata,function(x) sum(is.na(x)))
```
In total, there are `r sum(sapply(actdata,function(x) sum(is.na(x))))` missing values in the dataset, making up `r round(sum(sapply(actdata,function(x) sum(is.na(x)))) / nrow(actdata) * 100, 2)`% of the observations in our base dataset `actdata`.
In our calculations above, we simply replaced the missing values with $0's$. This might be a reasonable assumption at some points, such as for the intervals representing the times between 2 and 4 am, for example, but there are certainly missing observations during intervals where the source of the data probably was taking steps.
A more reasonable but still relatively simple method of filling in these missing values is by replacing any missing step values with the average number of steps taken during that 5-minute interval across the entire dataset. As a bonus, this will replace missing values with $0'$ during intervals where the average number of steps taken during that interval was $0$, such as during time periods when the data source was always asleep.
We can accomplish this with a `for` loop, creating a new dataset called `imputed`. See below.
```{r}
imputed <- actdata
for (i in 1:17568) {
if ( is.na(actdata[i,1]) == TRUE) {
imputed[i,1] <- interval_avgs[interval_avgs$interval == actdata[i,3],3]
}
}
```
If we look at our new dataset `imputed`, we can see that the number of missing values is now `r sum(sapply(imputed, function(x) is.na(x)))`, which is what we wanted. But if we look at the summary of `imputed` and the summary of our original `actdata`, we can see that while the mean and median number of steps taken in a 5-minute interval in the original data was *37.28* and *$0$*, respectively, in the new imputed dataset, the mean number of steps taken in a 5-minute interval is `r round(mean(imputed$steps),2)`, while the median number of steps taken is still `r median(imputed$steps)` steps. We can see these facts in the two outputs of a summary function call, below.
```{r}
summary(imputed$steps)
summary(actdata$steps)
```
This result seems confusing until we analyze it a little more closely. In both datasets, the median is 0, implying that on average, the source of the data does not move at all during any particular 5 minute interval. The means in both data sets are above 0, because the source of the data obviously does move at some point, but substituting imputed data in the place of missing observations brings down the mean of the `steps` observations, implying that many of the missing step values from `actdata` were replaced with low or zero step counts.
Looking at the histograms of the step counts of the original dataset alongside the imputed dataset makes the impact of filling in missing values with imputed values more obvious.
But before we can do that, we need to replicate the process we used to create `CleanDailySteps` to create an `ImputedDailySteps` data frame.
```{r}
group_by(imputed, date) %>%
summarize(DailyStepCount = sum(steps)) %>%
replace_na(list(date = 0, DailyStepCount = 0)) -> ImputedDailySteps
summary(ImputedDailySteps)
summary(CleanDailySteps)
```
Comparing the summary() outputs for `CleanDailySteps` and `ImputedDailySteps`, we can observe the facts.
1) In the original cleaned dataset,the mean number of daily steps taken was **`r round(mean(CleanDailySteps$DailyStepCount),2)` steps**, while the median number of daily steps taken was **`r median(CleanDailySteps$DailyStepCount)` steps.**
2) In the new dataset in which missing values were replaced by imputed values,the mean number of daily steps taken was **`r round(mean(ImputedDailySteps$DailyStepCount),0)` steps**, while the median number of daily steps taken was **`r median(ImputedDailySteps$DailyStepCount)` steps.**
+ Thus, by substituting imputed values for missing values and then calculating daily step counts, we can see that while the original data set was skewed left, with a mean lower than the median, in the imputed data set, the daily step counts are skewed right.
+ This tells us that including imputed values tends to increase the daily step count values.
We can clearly see the effect of including imputed data on our daily step counts by comparing the histogram of daily step counts for the original data frame, `CleanDailySteps`, with the histogram created using `ImputedDailySteps`. See below.
```{r warning=FALSE}
p1 <- ggplot(CleanDailySteps, aes(x=DailyStepCount)) +
geom_histogram(bins = 20, fill = "navajowhite", color = "midnightblue") +
labs(title = "Histogram of Daily Step Counts Using Cleaned Data, 20 Bins", y = "Count", x = "Total Steps Taken / Day") +
theme(plot.title = element_text(hjust = 0.5)) +
theme(panel.background = element_rect(fill = "lightskyblue1"),
plot.background = element_rect(fill = "lightskyblue1"),
panel.ontop = FALSE) +
theme(panel.grid.major.x = element_blank(),
panel.grid.minor.x = element_blank(),
panel.grid.major.y = element_line(color = "red4",
size = 0.75,
linetype = 2),
panel.grid.minor.y = element_line(color = "red4",
size = 0.25,
linetype = 2))
p2 <- ggplot(ImputedDailySteps, aes(x=DailyStepCount)) +
geom_histogram(bins = 20, fill = "navajowhite", color = "midnightblue") +
labs(title = "Histogram of Daily Step Counts Using Imputed Data, 20 Bins", y = "Count", x = "Total Steps Taken / Day") +
theme(plot.title = element_text(hjust = 0.5)) +
theme(panel.background = element_rect(fill = "lightskyblue1"),
plot.background = element_rect(fill = "lightskyblue1"),
panel.ontop = FALSE) +
theme(panel.grid.major.x = element_blank(),
panel.grid.minor.x = element_blank(),
panel.grid.major.y = element_line(color = "red4",
size = 0.75,
linetype = 2),
panel.grid.minor.y = element_line(color = "red4",
size = 0.25,
linetype = 2))
gridExtra::grid.arrange(p1, p2)
```
We can see that the **impact of imputing missing data on the estimates of the total daily number of steps** is to decrease the number of days in which the daily step count is $0$, and instead many of these formerly-$0$ days become days clustered approximately aroun the center of the data.
<h2 align="center">**Differences in activity patterns between weekdays and weekends**</h2>
First, we will create a new factor variable in our dataset with two levels, weekday and weekend, indicating whether a given date is a weekday or a weekend day.
First, looking at our dataset `imputed`, we see that the `date` column is currently being considered a character string column, and we want it as a dates column. Let's create a new dataset called imputed_dates in which `date` will be seen as date data.
```{r}
glimpse(imputed)
imputed_dates <- imputed
imputed_dates <- imputed_dates %>%
mutate_at(2,as.Date.character)
glimpse(imputed_dates)
```
Now that `date` is considered a date vector, we can create a vector of the same number of rows as actdata (n = 17,568) telling us the name of the day of the week associated with each row, then create a logical vector for TRUE if the day of the week is a weekday and FALSE if a weekend, convert that logical vector into a factor vector with ifelse(), and finally bind that factor vector to `imputed_dates`.
```{r}
daysoftheweek <- weekdays(imputed_dates[[2]])
day_type <- daysoftheweek %in% c("Monday","Tuesday","Wednesday","Thursday","Friday")
day_type <- as.factor(ifelse(day_type, "weekday", "weekend"))
imputed_dates <- cbind(imputed_dates, day_type)
glimpse(imputed_dates)
levels(imputed_dates$day_type)
```
Having done that, we now have to process our new raw dataset `imputed_dates` to collect and group the data by intervals, and finally we can make a panel plot containing a time-series plot of the 5-minute intervals on the x-axis and the average number of steps taken, averaged across all weekdays or weekendays, on the y-axis.
```{r}
imputed_dates %>% filter(day_type == "weekday") %>%
select(steps, interval) -> imputed_weekdays
imputed_dates %>% filter(day_type == "weekend") %>%
select(steps, interval) -> imputed_weekends
nrow(imputed_weekdays) + nrow(imputed_weekends) == nrow(actdata)
imputed_weekdays %>%
group_by(interval) %>%
summarize(TotSteps = sum(steps)) %>%
mutate(AvgSteps = TotSteps/61) -> imputed_weekdays_dailycounts
imputed_weekends %>%
group_by(interval) %>%
summarize(TotSteps = sum(steps)) %>%
mutate(AvgSteps = TotSteps/61) -> imputed_weekends_dailycounts
```
Now we can finally create our plots.
```{r}
p9 <- ggplot(imputed_weekdays_dailycounts, aes(x = interval, y = AvgSteps)) +
geom_line(color = "red2", linetype = 1) +
labs(title = "Average Steps Taken During Each \nFive Minute Interval Across All Days \n on *Weekdays*", y = "Average STeps Taken", x = "Time of 5-Minute Interval During the Day") +
theme(panel.background = element_rect(fill = "lavenderblush1"),
plot.background = element_rect(fill = "lavenderblush1"),
panel.ontop = FALSE) +
theme(plot.title = element_text(hjust = 0.5)) +
theme(panel.grid.major.x = element_line(color = "lightsteelblue4"),
panel.grid.minor.x = element_blank(),
panel.grid.major.y = element_line(color = "lightsteelblue4",
size = 0.75,
linetype = 2),
panel.grid.minor.y = element_blank())
p10 <- ggplot(imputed_weekends_dailycounts, aes(x = interval, y = AvgSteps)) +
geom_line(color = "red2", linetype = 1) +
labs(title = "Average Steps Taken During Each \nFive Minute Interval Across All Days \n on *Weekends*", y = "Average STeps Taken", x = "Time of 5-Minute Interval During the Day") +
theme(panel.background = element_rect(fill = "lavenderblush1"),
plot.background = element_rect(fill = "lavenderblush1"),
panel.ontop = FALSE) +
theme(plot.title = element_text(hjust = 0.5)) +
theme(panel.grid.major.x = element_line(color = "lightsteelblue4"),
panel.grid.minor.x = element_blank(),
panel.grid.major.y = element_line(color = "lightsteelblue4",
size = 0.75,
linetype = 2),
panel.grid.minor.y = element_blank())
gridExtra::grid.arrange(p9,p10)
```
To make the data a little easier to interpret visually, we can try replacing `geom_line` with `geom_smooth` with a low `span = ` setting.
```{r}
p11 <- ggplot(imputed_weekdays_dailycounts, aes(x = interval, y = AvgSteps)) +
geom_smooth(color = "red2", linetype = 1, span = 0.125) +
labs(title = "Average Steps Taken During Each \nFive Minute Interval Across All Days \n on *Weekdays*, Smoothed", y = "Average STeps Taken", x = "Time of 5-Minute Interval During the Day") +
theme(panel.background = element_rect(fill = "lavenderblush1"),
plot.background = element_rect(fill = "lavenderblush1"),
panel.ontop = FALSE) +
theme(plot.title = element_text(hjust = 0.5)) +
theme(panel.grid.major.x = element_line(color = "lightsteelblue4"),
panel.grid.minor.x = element_blank(),
panel.grid.major.y = element_line(color = "lightsteelblue4",
size = 0.75,
linetype = 2),
panel.grid.minor.y = element_blank())
p12 <- ggplot(imputed_weekends_dailycounts, aes(x = interval, y = AvgSteps)) +
geom_smooth(color = "red2", linetype = 1, span = 0.125) +
labs(title = "Average Steps Taken During Each \nFive Minute Interval Across All Days \n on *Weekends*, Smoothed", y = "Average STeps Taken", x = "Time of 5-Minute Interval During the Day") +
theme(panel.background = element_rect(fill = "lavenderblush1"),
plot.background = element_rect(fill = "lavenderblush1"),
panel.ontop = FALSE) +
theme(plot.title = element_text(hjust = 0.5)) +
theme(panel.grid.major.x = element_line(color = "lightsteelblue4"),
panel.grid.minor.x = element_blank(),
panel.grid.major.y = element_line(color = "lightsteelblue4",
size = 0.75,
linetype = 2),
panel.grid.minor.y = element_blank())
gridExtra::grid.arrange(p11,p12)
```
Looking at these plots, we can see that, on weekdays, the source of our data starts becoming active (at least insofar as they start taking measured steps) a little earlier, has a generally lower average level of activity during the day, and stops being active earlier. This intuitively matches with the concept that people might want to stay up later, potentially going out to various social or entertainment activities, on the weekends. In the future, it might make more sense for the sake of this analysis to also consider counting Friday as a weekend day, as the following day, Saturday, is also a weekend day, and we might expect Friday night's activity levels to be different from Monday through Thursday night's activity levels.