-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathcyclistic_prj.Rmd
359 lines (279 loc) · 10.9 KB
/
cyclistic_prj.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
---
title: 'Cyclistic Bike Share: Capstone Project'
author: "Hoang Nguyen"
date: "2023-05-12"
output:
html_document:
toc: yes
number_sections: yes
---
# Background
In 2016, Cyclistic launched a successful bike-share offering. Since then, the program has grown to a fleet of 5,824 bicycles that are geotracked and locked into a network of 692 stations across Chicago. The bikes can be unlocked from one station and returned to any other station in the system anytime. Until now, Cyclistic’s marketing strategy relied on building general awareness and appealing to broad consumer segments. One approach that helped make these things possible was the flexibility of its pricing plans: single-ride passes, full-day passes, and annual memberships. Customers who purchase single-ride or full-day passes are referred to as casual riders. Customers who purchase annual memberships are Cyclistic members.
# Ask
**Business Questions:**
How do annual members and casual riders use Cyclistic bikes differently?
**Key Stakeholders:**
* Lily Moreno
* Cyclistic marketing analytics team
* Cyclistic executive team
# Prepare
The data include information about Cyclistic's historical trip and customers' patterns in bike usage. It is obtained between April 2022 and March 2023 (12 months).
Link: [Data Source](https://divvy-tripdata.s3.amazonaws.com/index.html)
# Process
R will be the primary tool for data manipulation, analysis, and visualization.
## Data Preparation
**First, we need to import the tidyverse, lubridate, and ggplot2 packages to work with our data more efficiently.**
```{r}
library(tidyverse)
library(lubridate)
library(ggplot2)
```
**Set working directory:**
```{r}
setwd("/Users/hoangnguyen/Documents/Projects/Cyclistic_Project")
```
**Import the data from April 2022 to March 2023:**
```{r}
apr2022_trip <- read_csv('202204-divvy-tripdata.csv',show_col_types = FALSE)
may2022_trip <- read_csv('202205-divvy-tripdata.csv',show_col_types = FALSE)
jun2022_trip <- read_csv('202206-divvy-tripdata.csv',show_col_types = FALSE)
jul2022_trip <- read_csv('202207-divvy-tripdata.csv',show_col_types = FALSE)
aug2022_trip <- read_csv('202208-divvy-tripdata.csv',show_col_types = FALSE)
sep2022_trip <- read_csv('202209-divvy-publictripdata.csv',show_col_types = FALSE)
oct2022_trip <- read_csv('202210-divvy-tripdata.csv',show_col_types = FALSE)
nov2022_trip <- read_csv('202211-divvy-tripdata.csv',show_col_types = FALSE)
dec2022_trip <- read_csv('202212-divvy-tripdata.csv',show_col_types = FALSE)
jan2023_trip <- read_csv('202301-divvy-tripdata.csv',show_col_types = FALSE)
feb2023_trip <- read_csv('202302-divvy-tripdata.csv',show_col_types = FALSE)
mar2023_trip <- read_csv('202303-divvy-tripdata.csv',show_col_types = FALSE)
```
**Check column names before merging the data**
```{r}
colnames(apr2022_trip)
colnames(may2022_trip)
colnames(jun2022_trip)
colnames(jul2022_trip)
colnames(aug2022_trip)
colnames(sep2022_trip)
colnames(oct2022_trip)
colnames(nov2022_trip)
colnames(dec2022_trip)
colnames(jan2023_trip)
colnames(feb2023_trip)
colnames(mar2023_trip)
```
**Since all the data have the same column names, hence we can merge them together**
```{r}
trip_data = rbind(apr2022_trip, may2022_trip, jun2022_trip, jul2022_trip, aug2022_trip, sep2022_trip, oct2022_trip, nov2022_trip, dec2022_trip, jan2023_trip, feb2023_trip, mar2023_trip)
```
**Examine the data**
```{r}
head(trip_data)
str(trip_data)
nrow(trip_data)
```
## Data cleaning
**Let's create date, month, day, year, and day of week columns so that we can aggregate, analyze, and visualize our data later on.**
```{r}
trip_data$date <- as.Date(trip_data$started_at)
trip_data$month <- format(as.Date(trip_data$date), "%m")
trip_data$day <- format(as.Date(trip_data$date), "%d")
trip_data$year <- format(as.Date(trip_data$date), "%Y")
trip_data$day_of_week <- format(as.Date(trip_data$date), "%A")
```
**Remove start_lat, start_lng, end_lat, and end_lng from the dataframe since we do not need it for analysis.**
```{r}
trip_data <- trip_data %>%
select(-c(start_lat,start_lng,end_lat,end_lng))
```
**Count the total number of missing values**
```{r}
sum(is.na(trip_data))
```
*There are a total of 3471075 missing values.*
**Remove all the rows with missing values and assign to a new variable**
```{r}
trip_data_clean <- na.omit(trip_data)
```
*Before omitting all the rows with missing values, we have a total of 5803720 observations or rows. Now, we only have 4482362 rows.*
**Check for duplicated rows**
```{r}
nrow(trip_data_clean[duplicated(trip_data_clean), ])
```
*There are no duplicated rows*
**Create a ride length column in minutes and round to two decimal places**
```{r}
trip_data_clean$ride_length <- round(difftime(trip_data_clean$ended_at,trip_data_clean$started_at,units='mins'),2)
```
**Convert the data type of ride length to numeric for future analysis**
```{r}
trip_data_clean$ride_length <- as.numeric(as.character(trip_data_clean$ride_length))
```
**Count the number of negative ride length values**
```{r}
nrow(trip_data_clean[trip_data_clean$ride_length<0,])
```
**Order the day of week column**
```{r}
trip_data_clean$day_of_week <- ordered(trip_data_clean$day_of_week,
levels=c("Sunday", "Monday", "Tuesday", "Wednesday",
"Thursday", "Friday", "Saturday"))
```
**Remove rows with negative and zero ride length values**
```{r}
trip_data_clean <- trip_data_clean[!(trip_data_clean$ride_length<0 | trip_data_clean$ride_length==0),]
```
*After removing rows with negative and zero values, we now have 4482044 rows.*
# Analyze
Analyzing the data will help us to discover insights and trends on how annual and casual riders use bike differently.
## Descriptive Analysis
**Summarize the data (calculate the mean, median, min, max, first quartile, and third quartile)**
```{r}
summary(trip_data_clean$ride_length)
```
## Data aggregation
**Compare annual and casual riders in terms of average, lower, upper, and difference in ride lengths**
```{r}
trip_data_clean %>%
group_by(member_casual) %>%
summarise(lower_ride_length = min(ride_length), average_ride_length = mean(ride_length),
upper_ride_length = max(ride_length), difference_ride_length = max(ride_length) - min(ride_length))
```
*Casual riders, on average, ride bikes roughly 11 more minutes than annual riders.*
**Count each member type (annual vs. casual)**
```{r}
trip_data_clean %>%
group_by(member_casual) %>%
summarise(total_count = n())
```
**Count number of each bike type**
```{r}
trip_data_clean %>%
group_by(rideable_type) %>%
summarise(total_count = n())
```
**Count the number for each type of bike for each user type**
```{r}
trip_data_clean %>%
group_by(member_casual, rideable_type) %>%
summarise(total_count = n())
```
**Count number of rides by month and year**
```{r}
trip_data_clean %>%
group_by(month,year) %>%
summarise(total_count = n()) %>%
arrange(year)
```
**Count number of rides by weekday based on members type**
```{r}
trip_data_clean %>%
group_by(member_casual, day_of_week) %>%
summarise(total_count = n())
```
**Average ride length by bike type**
```{r}
trip_data_clean %>%
group_by(rideable_type) %>%
summarise(avg_ride_length = mean(ride_length))
```
**Average ride length by members type**
```{r}
trip_data_clean %>%
group_by(member_casual) %>%
summarise(avg_ride_length = mean(ride_length))
```
**Average ride length by members and bikes type**
```{r}
trip_data_clean %>%
group_by(member_casual, rideable_type) %>%
summarise(avg_ride_length = mean(ride_length))
```
**Average ride length by members type and day of week**
```{r}
trip_data_clean %>%
group_by(member_casual, day_of_week) %>%
summarise(avg_ride_length = mean(ride_length))
```
**Average ride length by members type and years**
```{r}
trip_data_clean %>%
group_by(member_casual,year) %>%
summarise(avg_ride_length = mean(ride_length))
```
# Share
Sharing the data will implement data visualizations to understand the data more effectively and gain insights visually.
## Data Visualizations
**Disable scientific notations**
```{r}
options(scipen=999)
```
**Create a new variable to calculate the percentage of annual and casual members for pie plot**
```{r}
member_perc <- trip_data_clean %>%
group_by(member_casual) %>%
summarise(count = n()) %>%
mutate(perc=count/sum(count)) %>%
ungroup()
```
**Create a pie chart that show the percentage distribution of member types**
```{r}
ggplot(member_perc, aes(x="", y=perc, fill=member_casual)) +
geom_bar(stat="identity", width=1,color="white") +
coord_polar("y", start=0) +
geom_text(aes(label = scales::percent(perc)),
position = position_stack(vjust = 0.5)) +
labs(title='Percentage Distribution of Annual and Casual Riders')+
theme_void()
```
**Total number of rides by bike and member types**
```{r}
trip_data_clean %>%
group_by(member_casual, rideable_type) %>%
summarise(total_count = n()) %>%
ggplot(aes(x=rideable_type, y = total_count, fill=member_casual)) +
geom_bar(position="dodge", stat="identity",color="black",width=0.7) +
labs(title = "Total Number of Rides by Bike and Member Types", x='Type of Bikes', y='Number of Rides')+
theme_classic()
```
**Total number of rides by member types and day of week**
```{r}
trip_data_clean %>%
group_by(member_casual, day_of_week) %>%
summarise(total_count = n()) %>%
ggplot(aes(x=day_of_week, y=total_count, fill=member_casual)) +
geom_bar(position="dodge", stat='identity',color="black",width=0.7) +
labs(title = "Total Number of Rides by Member Types and Day of Week",x='Day of Week',y='Number of Rides')+
theme_classic()
```
**Total number of rides by member types and months**
```{r}
trip_data_clean %>%
group_by(member_casual, month) %>%
summarise(total_count = n()) %>%
ggplot(aes(x=month, y = total_count, fill=member_casual)) +
geom_bar(position="dodge", stat="identity",color="black",width=0.7)+
labs(title = 'Total Number of Rides by Member Types and Months',x='Month',y='Number of Rides')+
theme_classic()
```
**Average ride duration spent by annual and casual members**
```{r}
trip_data_clean %>%
group_by(member_casual) %>%
summarise(avg_ride_length = mean(ride_length)) %>%
ggplot(aes(x=member_casual, y=avg_ride_length, fill=member_casual)) +
geom_bar(position="dodge", stat="identity",color="black",width=0.5) +
labs(title = 'Average Ride Duration Spent by Member Types',x='Type of Members',y='Average Ride Duration (mins)')+
theme_classic()
```
**Average ride duration spent on each bike between annual and casual members**
```{r}
trip_data_clean %>%
group_by(member_casual,rideable_type) %>%
summarise(avg_ride_length = mean(ride_length)) %>%
ggplot(aes(x=rideable_type, y=avg_ride_length, fill=rideable_type))+
geom_bar(position="dodge", stat="identity",color="black",width=0.5)+
facet_wrap(~member_casual)+
labs(title = 'Average Ride Duration Spent by Member and Bike Types', x='Type of Bikes', y='Average Ride Duration (mins)')+
theme_bw()
```