-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathReport.Rmd
251 lines (198 loc) · 9.03 KB
/
Report.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
---
title: 'Cyclistic and Data Visualization: "Advanced, Straightforward, and Peeled"
(Case Study)'
author: "Praveen Choragudi"
date: "`r Sys.Date()`"
output: html_document
---
## Cyclistic Bikes Full Year Analysis from Q4 of 2021 to Q3 of 2022
Based on Kevin Hartman's "'Sophisticated, Clear, and Polished': Divvy and Data Visualization" Divvy case study, which can be found at https://artscience.blog/home/divvy-dataviz-case-study, this analysis. This script's goal is to compile the Cyclistic data that has been obtained into a single dataframe and then perform a quick analysis to shed light on the fundamental question: "How do members and casual riders use Cyclistic bikes differently?"
Welcome to the case study on Cyclistic's bike sharing programme! which is a fictitious business. We will use the steps of the data analysis process—ask, prepare, process, analyse, communicate, and act—to provide answers to the important business issues. You may keep on track by using the Case Study Roadmap tables, which include directional questions and important tasks.
Install required packages
* tidyverse for data import and wrangling
* lubridate for date functions
* ggplot for visualization
```{r}
library(tidyverse) #helps wrangle data
library(lubridate) #helps wrangle date attributes
library(ggplot2) #helps visualize data
getwd() #displays your working directory
```
### Preparing quarterly data by merging multiple csv files
#### preparing file for Q4 of 2021
```{r}
library(data.table)
setwd("~/Desktop/Cyclistic_Bikes/Cyclistic_Bike_Share/Data/Trips/2021")
files<-list.files(pattern = ".csv")
temp<-lapply(files,fread,sep=",")
data<-rbindlist(temp)
write.csv(data,file="Cyclistic_Trips_2021_Q4.csv",row.names = FALSE)
```
#### preparing file for Q1 of 2022
```{r}
library(data.table)
setwd("~/Desktop/Cyclistic_Bikes/Cyclistic_Bike_Share/Data/Trips/2022/Q1")
files<-list.files(pattern = ".csv")
temp<-lapply(files,fread,sep=",")
data<-rbindlist(temp)
write.csv(data,file="Cyclistic_Trips_2022_Q1.csv",row.names = FALSE)
```
#### preparing file for Q2 of 2022
```{r}
library(data.table)
setwd("~/Desktop/Cyclistic_Bikes/Cyclistic_Bike_Share/Data/Trips/2022/Q2")
files<-list.files(pattern = ".csv")
temp<-lapply(files,fread,sep=",")
data<-rbindlist(temp)
write.csv(data,file="Cyclistic_Trips_2022_Q2.csv",row.names = FALSE)
```
#### preparing file for Q3 of 2022
```{r}
library(data.table)
setwd("~/Desktop/Cyclistic_Bikes/Cyclistic_Bike_Share/Data/Trips/2022/Q3")
files<-list.files(pattern = ".csv")
temp<-lapply(files,fread,sep=",")
data<-rbindlist(temp)
write.csv(data,file="Cyclistic_Trips_2022_Q3.csv",row.names = FALSE)
```
## STEP 1: COLLECT DATA
```{r}
# Upload Cyclistic datasets (csv files) here
q4_2021 <- read_csv("Cyclistic_Trips_2021_Q4.csv")
q1_2022 <- read_csv("Cyclistic_Trips_2022_Q1.csv")
q2_2022 <- read_csv("Cyclistic_Trips_2022_Q2.csv")
q3_2022 <- read_csv("Cyclistic_Trips_2022_Q3.csv")
```
## STEP 2: WRANGLING DATA AND COMBINING INTO A SINGLE FILE
#### Comparing column names each of the files. While the names don't have to be in the same order, they DO need to match perfectly before we can use a command to join them into one file.
```{r}
colnames(q4_2021)
colnames(q1_2022)
colnames(q2_2022)
colnames(q3_2022)
```
#### Inspecting the dataframes and looking for incongruencies
```{r}
str(q4_2021)
str(q1_2022)
str(q2_2022)
str(q3_2022)
```
#### Stacking individual quarter's data frames into one big data frame
```{r}
all_trips <- bind_rows(q4_2021, q1_2022, q2_2022, q3_2022)
```
#### Removing lat, long, and gender fields as this data was dropped beginning in 2020
```{r}
all_trips <- all_trips %>%
select(-c(start_lat, start_lng, end_lat, end_lng))
```
## STEP 3: CLEAN UP AND ADD DATA TO PREPARE FOR ANALYSIS
### Inspecting the new table that has been created
```{r}
colnames(all_trips) #List of column names
nrow(all_trips) #How many rows are in data frame?
dim(all_trips) #Dimensions of the data frame?
head(all_trips) #See the first 6 rows of data frame. Also tail(all_trips)
str(all_trips) #See list of columns and data types (numeric, character, etc)
summary(all_trips) #Statistical summary of data. Mainly for numerics
```
##### Adding columns that list the date, month, day, and year of each ride which allows us to aggregate ride data for each month, day, or year before completing these operations we could only aggregate at the ride level more on date formats in R found at that [link](https://www.statmethods.net/input/dates.html).
```{r warning=FALSE}
all_trips$date <- as.Date(all_trips$started_at) #The default format is yyyy-mm-dd
all_trips$month <- format(as.Date(all_trips$date), "%m")
all_trips$day <- format(as.Date(all_trips$date), "%d")
all_trips$year <- format(as.Date(all_trips$date), "%Y")
all_trips$day_of_week <- format(as.Date(all_trips$date), "%A")
```
#### Adding a "ride_length" [calculation](https://stat.ethz.ch/R-manual/R-devel/library/base/html/difftime.html) to all_trips (in seconds)
```{r}
all_trips$ride_length <- difftime(all_trips$ended_at,all_trips$started_at)
```
### Inspecting the structure of the columns
```{r}
str(all_trips)
```
### Converting "ride_length" from Factor to numeric so we can run calculations on the data
```{r}
is.factor(all_trips$ride_length)
all_trips$ride_length <- as.numeric(as.character(all_trips$ride_length))
is.numeric(all_trips$ride_length)
```
#### Removing "bad" data. The [dataframe](https://www.datasciencemadesimple.com/delete-or-drop-rows-in-r-with-conditions-2/) includes a few hundred entries when bikes were taken out of docks and checked for quality by Cyclistic or ride_length was negative. We will create a new version of the dataframe (v2) since data is being removed.
```{r}
all_trips_v2 <- all_trips[!(all_trips$ride_length<=0),]
```
## STEP 4: CONDUCT DESCRIPTIVE ANALYSIS
### Descriptive analysis on ride_length (all figures in seconds)
```{r}
mean(all_trips_v2$ride_length) #straight average (total ride length / rides)
median(all_trips_v2$ride_length) #midpoint number in the ascending array of ride lengths
max(all_trips_v2$ride_length) #longest ride
min(all_trips_v2$ride_length) #shortest ride
```
#### We can condense the four lines above to one line using summary() on the specific attribute
```{r}
summary(all_trips_v2$ride_length)
```
#### Comparing members and casual users
```{r}
aggregate(all_trips_v2$ride_length ~ all_trips_v2$member_casual, FUN = mean)
aggregate(all_trips_v2$ride_length ~ all_trips_v2$member_casual, FUN = median)
aggregate(all_trips_v2$ride_length ~ all_trips_v2$member_casual, FUN = max)
aggregate(all_trips_v2$ride_length ~ all_trips_v2$member_casual, FUN = min)
```
#### We can see the average ride time by each day for members vs casual users
```{r}
aggregate(all_trips_v2$ride_length ~ all_trips_v2$member_casual + all_trips_v2$day_of_week, FUN = mean)
```
#### Notice that the days of the week are out of order. Let's fix that.
```{r}
all_trips_v2$day_of_week <- ordered(all_trips_v2$day_of_week, levels=c("Sunday", "Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday"))
```
#### Now, let's run the average ride time by each day for members vs casual users
```{r}
aggregate(all_trips_v2$ride_length ~ all_trips_v2$member_casual + all_trips_v2$day_of_week, FUN = mean)
```
```{r}
str(all_trips_v2)
```
#### Analyzing ridership data by type and weekday
```{r}
all_trips_v2 %>%
group_by(member_casual, day_of_week) %>% #groups by usertype and weekday
summarise(number_of_rides = n() #calculates the number of rides and average duration
,average_duration = mean(ride_length)) %>% # calculates the average duration
arrange(member_casual, day_of_week) # sorts
```
#### Let's visualize the number of rides by rider type
```{r}
all_trips_v2 %>%
group_by(member_casual, day_of_week) %>%
summarise(number_of_rides = n()
,average_duration = mean(ride_length)) %>%
arrange(member_casual, day_of_week) %>%
ggplot(aes(x = day_of_week, y = number_of_rides, fill = member_casual)) +
geom_col(position = "dodge")
```
#### Let's create a visualization for average duration
```{r}
all_trips_v2 %>%
group_by(member_casual, day_of_week) %>%
summarise(number_of_rides = n()
,average_duration = mean(ride_length)) %>%
arrange(member_casual, day_of_week) %>%
ggplot(aes(x = day_of_week, y = average_duration, fill = member_casual)) +
geom_col(position = "dodge")
```
## STEP 5: EXPORT SUMMARY FILE FOR FURTHER ANALYSIS
```{r}
counts <- aggregate(all_trips_v2$ride_length ~ all_trips_v2$member_casual + all_trips_v2$day_of_week, FUN = mean)
write.csv(counts, file = '~/Desktop/Cyclistic_Bikes/avg_ride_length.csv')
chart1<-read.csv("avg_ride_length.csv")
colnames(chart1)<-c("Count","User_Type","Day_of_the_Week","Trip_Duration_in_Seconds")
colnames(chart1)
library(ggplot2)
ggplot(data=chart1)+geom_point(mapping = aes(x=Day_of_the_Week,y=Trip_Duration_in_Seconds, color=User_Type,shape=User_Type))+labs(title ="Usage by Members and Casual riders" ,subtitle ="Frequency of trip time between User Types",caption = "Data is from Q4(2021) and Q1-Q4(2022)")
```
###### THANK YOU