-
Notifications
You must be signed in to change notification settings - Fork 0
/
Bike_sharing Data Analytics.Rmd
315 lines (212 loc) · 8.79 KB
/
Bike_sharing Data Analytics.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
---
title: "Bike Sharing System Data Analytics"
author: "Yash Prakash"
date: "31 August 2019"
output: html_document
---
# Introduction
Bike sharing systems are a means of renting bicycles where the process of obtaining membership, rental, and bike return is automated via a network of kiosk locations throughout a city. Using these systems, people are able rent a bike from a one location and return it to a different place on an as needed basis. Currently, there are over 500 bike-sharing programs around the world.
The data generated by these systems makes them attractive for researchers because the duration of travel, departure location, arrival location, and time elapsed is explicitly recorded. Bike sharing systems therefore function as a sensor network, which can be used for studying mobility in a city.
# The Problem
In this project, we have combined historical usage patterns with weather data in order to perform exploratory data analysis and multivariate analysis (hypothesis testing) on bike rental demand and attempt to accurately forecast bike rental demand based on the same variables.
## Loading the Dataset
```{r, include = FALSE}
train <- read.csv("train.csv")
test <- read.csv("test.csv")
```
## Viewing the features of the data
The various columns of the record of bike-sharing data are as follows:
```{r cars}
head(train)
```
## Looking at the features of the data:
datetime - hourly date + timestamp
season - 1 = spring, 2 = summer, 3 = fall, 4 = winter
holiday - whether the day is considered a holiday
workingday - whether the day is neither a weekend nor holiday
weather - 1: Clear, Few clouds, Partly cloudy, Partly cloudy
2: Mist + Cloudy, Mist + Broken clouds, Mist + Few clouds, Mist
3: Light Snow, Light Rain + Thunderstorm + Scattered clouds, Light Rain + Scattered clouds
4: Heavy Rain + Ice Pallets + Thunderstorm + Mist, Snow + Fog
temp - temperature in Celsius
atemp - "feels like" temperature in Celsius
humidity - relative humidity
windspeed - wind speed
casual - number of non-registered user rentals initiated
registered - number of registered user rentals initiated
count - number of total rentals
The dimensions of train set is:
```{r}
dim(train)
```
```{r}
head(test)
```
The dimensions of the test is:
```{r}
dim(test)
```
## Combining test with train
This is done to get the overall larger set of data to train on.
```{r }
test$registered=0
test$casual=0
test$count=0
data=rbind(train,test)
```
The dimensions of the dataframe 'data' now is:
```{r}
dim(data)
```
The structure of the data:
```{r}
str(data)
```
Here:
The dependent variables include only: casual, registered and the total count of users
## Data Visualisations
Plotting the histogram for every independent variable
```{r fig.height = 7, fig.width = 13}
par(mfrow=c(4,2))
par(mar = rep(2, 4))
hist(data$season)
hist(data$weather)
hist(data$humidity)
hist(data$holiday)
hist(data$workingday)
hist(data$temp)
hist(data$atemp)
hist(data$windspeed)
```
### Analysing probability distributions
1. For weather variable
```{r}
prop.table(table(data$weather))
```
Thus, Weather 1 ( mostly clear weather) has more counts of users
2. For working day variable
```{r}
prop.table(table(data$workingday))
```
Thus, working day has more counts of users using the bikes
3. Holiday or not
```{r}
prop.table(table(data$holiday))
```
Thus, no holiday means more bike rental.
4. Seasons variable
```{r}
prop.table(table(data$season))
```
Thus, almost equal distribution in all 4 seasons.
### Converting the discrete variables into factor (season, weather, holiday, workingday)
```{r}
data$season<-as.factor(data$season)
data$weather<-as.factor(data$weather)
data$holiday<-as.factor(data$holiday)
data$workingday<-as.factor(data$workingday)
```
Viewing the structure of the new data:
```{r}
str(data)
```
Here:
The categorical variables are: Datetime, season, holiday, workingday, weather.
The numerical variables are: temp, atemp, humidity and windspeed
### Hypothesis Testing (multivariate analysis of the data)
#### 1. Hourly trend
Hourly trend: There must be high demand during office timings.
Early morning and late evening can have different trend (cyclist)
and low demand during 10:00 pm to 4:00 am.
Extract using datetime column
```{r}
data$hour<-substr(data$datetime,12,13)
data$hour<-as.factor(data$hour)
```
Greater than 20 users , less than 19 users
```{r}
train<-data[as.integer(substr(data$datetime,9,10))<20,]
test<-data[as.integer(substr(data$datetime,9,10))>19,]
```
Visualising it:
```{r fig.height = 7, fig.width = 10}
par(mfrow = c(1,1))
boxplot(train$count~train$hour,xlab="hour", ylab="count of users")
```
Thus, most users in the following trend:
High : 7-9 and 17-19 hours
Average : 10-16 hours
Low : 0-6 and 20-24 hours
#### Hypothesis 2. Weekdays Trend
To see if the registered users use bike system more on weekdays than weekends.
```{r}
date <- substr(data$datetime,1,10)
days<-weekdays(as.Date(date))
data$day=days
```
Visualising it:
```{r}
boxplot(data$registered ~ data$day, xlab = "day", ylab = "count of registered users")
```
```{r}
boxplot(data$casual ~ data$day, xlab = "day", ylab = "count of casual users")
```
Thus, it can be seen that the casual users prefer to use the bikes more on weekends while registered users do so more on weekdays.
#### Hypothesis 3: 'Rain' Trend:
People prefer to rent bikes more on clear weather than rainy for total users
```{r}
boxplot(data$count ~ data$weather, xlab= "Weather conditions", ylab = "Count of total users")
```
For registered users
```{r}
boxplot(data$registered ~ data$weather, xlab= "Weather conditions", ylab = "Count of registered users")
```
For casual users
```{r}
boxplot(data$casual ~ data$weather, xlab= "Weather conditions", ylab = "Count of casual users")
```
#### Hypothesis 4: The correlation table of continous variables
Variables lile wind, temperature and humididity with the reg, casual and total user count
Making a subset of data and displaying the co-relation table:
```{r}
subset <- data.frame(train$registered, train$casual, train$count, train$temp,
train$humidity, train$atemp, train$windspeed)
cor_sub <- cor(subset)
```
Visualising the correlations:
```{r}
library(corrplot)
corrplot(cor_sub, method = 'color')
```
The inferences are:
1. Variable atemp is highly correlated with temp.
2. Variable temp is positively correlated with the dependent variables (casual is more compare to registered)
3. Windspeed has lower correlation as compared to temp and humidity
### Multi-Variable Analysis
#### Looking over the count of Bikes Rented by hour and season
```{r fig.height = 7, fig.width = 13}
library(sqldf)
library(ggplot2)
# Get the average count of bikes rent by season, hour
season_summary_by_hour <- sqldf('select season, hour, avg(count) as count from data group by season, hour')
ggplot(data, aes(x=hour, y=count, color=season))+geom_point(data = season_summary_by_hour, aes(group = season))+geom_line(data = season_summary_by_hour, aes(group = season))+ggtitle("Bikes Rent By Season and Hour")+ scale_colour_hue('Season',breaks = levels(data$season), labels=c('spring', 'summer', 'fall', 'winter'))
```
Here, we see that most bikes are rented during the hours of 17-19 in the seasons of fall and summer with the least in spring.
#### Looking over the bikes rented by weather and hour
```{r fig.height = 7, fig.width = 13}
# Get the average count of bikes rent by weather, hour
weather_summary_by_hour <- sqldf('select weather, hour, avg(count) as count from data group by weather, hour')
ggplot(data, aes(x=hour, y=count, color=weather))+geom_point(data = weather_summary_by_hour, aes(group = weather))+geom_line(data = weather_summary_by_hour, aes(group = weather))+ggtitle("Bikes Rent By Weather and Hour")+ scale_colour_hue('Weather',breaks = levels(data$weather), labels=c('Good', 'Normal', 'Bad', 'Very Bad'))
```
From this plot it is seen that,
1. People rent bikes the most when weather is good and between hours 17 - 19
2. We see bike rentals are minimum during the early hours of the morning and during bad weather
#### Looking over the bikes rented by the weekdays and hour
```{r fig.height = 7, fig.width = 13}
# Get the average count of bikes rent by day, hour
day_summary_by_hour <- sqldf('select day, hour, avg(count) as count from data group by day, hour')
ggplot(data, aes(x=hour, y=count, color=day, label= TRUE))+geom_point(data = day_summary_by_hour, aes(group = day))+geom_line(data = day_summary_by_hour, aes(group = day))+ggtitle("Bikes Rent By the days and Hours")+ scale_colour_hue('Weekday',breaks = levels(data$day), labels=c('Monday','Tuesday','Wednesday','Thursday','Friday','Saturday','Sunday'))
```
Here we can see that,
1. There are more bikes rented on weekdays during 7-9 hours in the morning and 17-19 hours in the evening
2. There are more bikes rentals between 12 - 16 hours on weekends