options(scipen = 1, digits = 7)
library(data.table)
data <- read.csv("activity.csv")
data$day <- as.Date(data$date, "%Y-%m-%d")
data$weekday <- weekdays(data$day)
weekday.list <- unique(data$weekday)
day.type <- c(rep("Weekday", 5), rep("Weekend", 2))
data$dayType <- day.type[match(data$weekday, weekday.list)]
data.noNA <- data[!is.na(data$steps),]
Sum the steps for each day.
total.steps.by.day <- xtabs(steps ~ day, data=data.noNA)
Create a histogram of the mean number of steps per day.
hist(total.steps.by.day, main="Histogram of Total Number of Steps per Day", xlab="")
Calculate the mean of the number of steps per day.
mean(total.steps.by.day)
## [1] 10766.19
Calculate the median of the number of steps per day.
median(as.vector(total.steps.by.day))
## [1] 10765
Take the average number of steps by the interval across days.
library(plyr)
avgsteps <- ddply(data.noNA, .(interval), summarize, avg=mean(steps))
with(avgsteps, plot(interval, avg, type="l",
main="Average Daily Activity Pattern",
xlab="Interval (5 minute increment)",
ylab="Average Number of Steps"))
Interval with the maximum number of steps on average across the day.
avgsteps[avgsteps$avg == max(avgsteps$avg),"interval"]
## [1] 835
Total number of mission values in original data.
sum(is.na(data$steps))
## [1] 2304
For simplicity replace missing values with the average daily number of steps for that interval. Round the average number of steps to a whole value since there are not partial steps.
nareplaced <- merge(data, avgsteps, by="interval")
na.idx <- which(is.na(nareplaced$steps))
nareplaced$steps[na.idx] <- round(nareplaced[na.idx,"avg"],0)
Find the number of steps per day for new data set.
intervalsteps <- ddply(nareplaced, .(day), summarize, steps=sum(steps))
Histogram with the total number of steps taken per day
hist(intervalsteps$steps, main="Total Number of Steps per Day by 5-Minute Interval (NAs Replaced)", xlab="")
Calculate the mean of the number of steps per day.
mean(intervalsteps$steps)
## [1] 10765.64
Calculate the median of the number of steps per day.
median(intervalsteps$steps)
## [1] 10762
The overall shape of the distribution remained the same with a higher peak. The mean was almost the same which makes sense given that the rounded average of each interval was used to fill in the corresponding interval with a missing value. The median moved only slightly due to the addition of additional values since the mean and median of the observations with step values were very close.
Convert the day type variable created above during the initial processing to a factor. Compute average steps per interval using new data then plot the data to compare weekdays with weekends.
library(ggplot2)
nareplaced$dayType <- factor(nareplaced$dayType)
intervalsteps <- aggregate(steps ~ interval + dayType, data=nareplaced, FUN=mean)
ggplot(intervalsteps, aes(interval, steps)) + geom_line() + xlab("Interval") + ylab("Number of steps") + facet_grid(dayType ~ .)