forked from rdpeng/RepData_PeerAssessment1
-
Notifications
You must be signed in to change notification settings - Fork 0
/
PA1_template.Rmd
174 lines (130 loc) · 6.47 KB
/
PA1_template.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
---
title: "Reproducible Research: Peer Assessment 1"
output:
html_document:
keep_md: true
---
## Loading and preprocessing the data
The following libraries will be used to complete this analysis:
```{r libraries,message=FALSE}
library(dplyr)
library(lubridate)
library(ggplot2)
```
We begin by loading the tracker data. This data is in a clean CSV, and we load it
straight from the compressed data file provided. We also parse the dates into proper
dates at this point.
```{r load_data}
data <- read.csv(unz("activity.zip", "activity.csv"), stringsAsFactors = FALSE)
data$date <- ymd(data$date)
```
## What is mean total number of steps taken per day?
We are interested in the number of steps taken every day. In this case, it's acceptable to
ignore missing values, so we do so. For the days where there are NO values, we will still
count "zero steps per day." While this is less accurate than "ignoring those days entirely"
would be, it is also more stable, because the addition/change of a single data point is not
likely to perturn the final result.
We compute a basic data frame of days to steps-per-day, and show what the data look like in
a histogram. We also compute the median and the mean.
```{r steps_per_day}
# Function to compute steps-per-day aggregate
aggregateStepsPerDay <- function(raw_data) {
raw_data %>%
select(date, steps) %>%
group_by(date) %>%
summarise_each(funs(sum(., na.rm = TRUE)))
}
# Compute number of steps per day.
steps_per_day <- aggregateStepsPerDay(data)
# Compute the mean and median
mean_steps_per_day <- mean(steps_per_day$steps)
median_steps_per_day <- median(steps_per_day$steps)
# Plot histogram of number of steps per day.
ggplot(data = steps_per_day, aes(x=steps)) + geom_histogram(binwidth = 2000) +
labs(x = "Number of Steps", y = "Count") +
labs(title = "Histogram of Distribution of Steps Taken Per Day")
```
The median and mean are:
- Mean steps per day: `r format(mean_steps_per_day, scientific=FALSE)`
- Median steps per day: `r format(median_steps_per_day, scientific=FALSE)`
## What is the average daily activity pattern?
We now compute the average activity pattern across all days, by interval. This will show, on average,
at what time of the day our subject is most active.
```{r average_daily_activity_pattern}
# Compute average steps per interval
avg_steps_per_interval <- data %>%
select(interval, steps) %>%
group_by(interval) %>%
summarise_each(funs(mean(., na.rm = TRUE)))
# Compute interval with the max activity on average.
max_avg_activity_interval_value <- max(avg_steps_per_interval$steps)
max_avg_activity_interval <- avg_steps_per_interval[which.max(avg_steps_per_interval$steps), ]$interval
# Plot the result, showing average time over the day
ggplot(data = avg_steps_per_interval, aes(interval, steps)) + geom_line() +
labs(x = "5-Minute Time Interval", y = "Average Activity (steps)") +
labs(title = "Average Activity Over Time Intervals")
```
The interval with the max activity is `r max_avg_activity_interval`, with `r format(max_avg_activity_interval_value, digits=5)` steps.
## Imputing missing values
There are a number of missing values in the data set. In this section, we look into the situation, and see how robust our
previous analysis was, where we ignored all missing values.
The data set contains `r sum(is.na(data$steps))` missing values.
We will impute the missing values by substituting in the mean for that 5-minute interval, which we've
already calculated above.
```{r impute_missing}
# Impute the values, as follows:
# 1) Left-join with the average steps per interval data, on interval. This will split "steps" into
# "steps.x" and "steps.y"
# 2) Create a new value "steps", which will be equal to "steps.x" if it's defined, and "steps.y" if
# "steps.x" is NA.
# 3) Drop the interstitial "steps.x" and "steps.y" variables.
imputed_data <- left_join(data, avg_steps_per_interval, by = "interval") %>%
mutate(steps = ifelse(is.na(steps.x), steps.y, steps.x)) %>%
select(steps, date, interval)
```
Now, recalculate the total steps per day, along with the mean and median, so we can compare to the old results.
```{r imputed_steps_per_day}
# Compute number of steps per day.
imputed_steps_per_day <- aggregateStepsPerDay(imputed_data)
# Compute the mean and median
imputed_mean_steps_per_day <- mean(imputed_steps_per_day$steps)
imputed_median_steps_per_day <- median(imputed_steps_per_day$steps)
# Plot histogram of number of steps per day.
ggplot(data = imputed_steps_per_day, aes(x=steps)) + geom_histogram(binwidth = 2000) +
labs(x = "Number of Steps", y = "Count") +
labs(title = "Histogram of Distribution of Steps Taken Per Day (with imputed missing values)")
```
The new median and mean are:
- Imputed mean steps per day: `r format(imputed_mean_steps_per_day, scientific=FALSE)`
- Imputed median steps per day: `r format(imputed_median_steps_per_day, scientific=FALSE)`
Recall the original values, for comparison:
- Original mean steps per day (from earlier): `r format(mean_steps_per_day, scientific=FALSE)`
- Original median steps per day (from earlier): `r format(median_steps_per_day, scientific=FALSE)`
## Are there differences in activity patterns between weekdays and weekends?
To separate out weekends from weekdays, we add a factor to represent whether the day in question is a weekday.
```{r weekday_factor}
weekdayFactor <- function(date) {
as.factor(
ifelse(
weekdays(date) %in% c("Saturday", "Sunday"),
"Weekend",
"Weekday"
)
)
}
imputed_data$weekday <- weekdayFactor(imputed_data$date)
```
To quickly visualize the differences (if any) between weekday and weekend activity, we plot average activity
across all days, for each interval, split by weekday/weekend levels.
```{r average_daily_activity_pattern_weekday_weekend}
# Compute average steps per interval, by weekday
weekday_avg_steps_per_interval <- imputed_data %>%
select(interval, weekday, steps) %>%
group_by(interval, weekday) %>%
summarise_each(funs(mean))
# Plot the result, showing average time over the day
ggplot(data = weekday_avg_steps_per_interval, aes(interval, steps)) + geom_line() +
facet_grid(weekday ~ .) +
labs(x = "5-Minute Time Interval", y = "Average Activity (steps)") +
labs(title = "Average Activity Over Time Intervals")
```