Bike_sharing Data Analytics.Rmd

---
title: "Bike Sharing System Data Analytics"
author: "Yash Prakash"
date: "31 August 2019"
output: html_document
---
# Introduction
Bike sharing systems are a means of renting bicycles where the process of obtaining membership, rental, and bike return is automated via a network of kiosk locations throughout a city. Using these systems, people are able rent a bike from a one location and return it to a different place on an as needed basis. Currently, there are over 500 bike-sharing programs around the world.

The data generated by these systems makes them attractive for researchers because the duration of travel, departure location, arrival location, and time elapsed is explicitly recorded. Bike sharing systems therefore function as a sensor network, which can be used for studying mobility in a city. 


# The Problem 

In this project, we have combined historical usage patterns with weather data in order to perform exploratory data analysis and multivariate analysis (hypothesis testing) on bike rental demand and attempt to accurately forecast bike rental demand based on the same variables.


## Loading the Dataset
```{r, include = FALSE}
train <- read.csv("train.csv")
test <- read.csv("test.csv")
```

## Viewing the features of the data 

The various columns of the record of bike-sharing data are as follows: 

```{r cars}
head(train)
```

## Looking at the features of the data:

datetime - hourly date + timestamp  
season -  1 = spring, 2 = summer, 3 = fall, 4 = winter 
holiday - whether the day is considered a holiday
workingday - whether the day is neither a weekend nor holiday
weather - 1: Clear, Few clouds, Partly cloudy, Partly cloudy
2: Mist + Cloudy, Mist + Broken clouds, Mist + Few clouds, Mist
3: Light Snow, Light Rain + Thunderstorm + Scattered clouds, Light Rain + Scattered clouds
4: Heavy Rain + Ice Pallets + Thunderstorm + Mist, Snow + Fog 
temp - temperature in Celsius
atemp - "feels like" temperature in Celsius
humidity - relative humidity
windspeed - wind speed
casual - number of non-registered user rentals initiated
registered - number of registered user rentals initiated
count - number of total rentals


The dimensions of train set is:
```{r}
dim(train)
```


```{r}
head(test)
```

The dimensions of the test is:
```{r}
dim(test)
```

## Combining test with train 
This is done to get the overall larger set of data to train on.

```{r }
test$registered=0
test$casual=0
test$count=0
data=rbind(train,test)
```


The dimensions of the dataframe 'data' now is:
```{r}
dim(data)
```

The structure of the data:
```{r}
str(data)
```


Here:
The dependent variables include only: casual, registered and the total count of users

## Data Visualisations

Plotting the histogram for every independent variable 

```{r fig.height = 7, fig.width = 13}
par(mfrow=c(4,2))
par(mar = rep(2, 4))
hist(data$season)
hist(data$weather)
hist(data$humidity)
hist(data$holiday)
hist(data$workingday)
hist(data$temp)
hist(data$atemp)
hist(data$windspeed)
```


### Analysing probability distributions

1. For weather variable
```{r}
prop.table(table(data$weather))
``` 
 
Thus, Weather 1 ( mostly clear weather) has more counts of users


2. For working day variable
```{r}
prop.table(table(data$workingday))
```

Thus, working day has more counts of users using the bikes


3. Holiday or not
```{r}
prop.table(table(data$holiday))
```

Thus, no holiday means more bike rental.


4. Seasons variable
```{r}
prop.table(table(data$season))
```

Thus, almost equal distribution in all 4 seasons. 


### Converting the discrete variables into factor (season, weather, holiday, workingday)

```{r}
data$season<-as.factor(data$season)
data$weather<-as.factor(data$weather)
data$holiday<-as.factor(data$holiday)
data$workingday<-as.factor(data$workingday)
```

Viewing the structure of the new data:
```{r}
str(data)
```

Here: 
The categorical variables are: Datetime, season, holiday, workingday, weather.

The numerical variables are: temp, atemp, humidity and windspeed


### Hypothesis Testing (multivariate analysis of the data)

#### 1. Hourly trend
Hourly trend: There must be high demand during office timings. 
Early morning and late evening can have different trend (cyclist) 
and low demand during 10:00 pm to 4:00 am.
Extract using datetime column
```{r}
data$hour<-substr(data$datetime,12,13)
data$hour<-as.factor(data$hour)
```

Greater than 20 users , less than 19 users
```{r}
train<-data[as.integer(substr(data$datetime,9,10))<20,]
test<-data[as.integer(substr(data$datetime,9,10))>19,]
```

Visualising it: 
```{r fig.height = 7, fig.width = 10}
par(mfrow = c(1,1))
boxplot(train$count~train$hour,xlab="hour", ylab="count of users")
```

Thus, most users in the following trend:
High       : 7-9 and 17-19 hours
Average  : 10-16 hours
Low         : 0-6 and 20-24 hours


#### Hypothesis 2. Weekdays Trend

To see if the registered users use bike system more on weekdays than weekends.

```{r}
date <- substr(data$datetime,1,10)
days<-weekdays(as.Date(date))
data$day=days
```

Visualising it:

```{r}
boxplot(data$registered ~ data$day, xlab = "day", ylab = "count of registered users")
```

```{r}
boxplot(data$casual ~ data$day, xlab = "day", ylab = "count of casual users")
```

Thus, it can be seen that the casual users prefer to use the bikes more on weekends while registered users do so more on weekdays.


#### Hypothesis 3: 'Rain' Trend: 

People prefer to rent bikes more on clear weather than rainy for total users

```{r}
boxplot(data$count ~ data$weather, xlab= "Weather conditions", ylab = "Count of total users")
```

For registered users

```{r}
boxplot(data$registered ~ data$weather, xlab= "Weather conditions", ylab = "Count of registered users")
```

For casual users

```{r}
boxplot(data$casual ~ data$weather, xlab= "Weather conditions", ylab = "Count of casual users")
```


#### Hypothesis 4:  The correlation table of continous variables 

Variables lile wind, temperature and humididity with the reg, casual and total user count

Making a subset of data and displaying the co-relation table: 

```{r}
subset <- data.frame(train$registered, train$casual, train$count, train$temp, 
                     train$humidity, train$atemp, train$windspeed)
cor_sub <- cor(subset)
```

Visualising the correlations:

```{r}
library(corrplot)
corrplot(cor_sub, method = 'color')
```

The inferences are: 
1. Variable atemp is highly correlated with temp.
2. Variable temp is positively correlated with the dependent variables (casual is more compare to registered)
3. Windspeed has lower correlation as compared to temp and humidity


### Multi-Variable Analysis

#### Looking over the count of Bikes Rented by hour and season

```{r fig.height = 7, fig.width = 13}
library(sqldf)
library(ggplot2)
# Get the average count of bikes rent by season, hour
season_summary_by_hour <- sqldf('select season, hour, avg(count) as count from data group by season, hour')

ggplot(data, aes(x=hour, y=count, color=season))+geom_point(data = season_summary_by_hour, aes(group = season))+geom_line(data = season_summary_by_hour, aes(group = season))+ggtitle("Bikes Rent By Season and Hour")+ scale_colour_hue('Season',breaks = levels(data$season), labels=c('spring', 'summer', 'fall', 'winter'))
```

Here, we see that most bikes are rented during the hours of 17-19 in the  seasons of fall and summer with the least in spring.


#### Looking over the bikes rented by weather and hour

```{r fig.height = 7, fig.width = 13}
# Get the average count of bikes rent by weather, hour
weather_summary_by_hour <- sqldf('select weather, hour, avg(count) as count from data group by weather, hour')


ggplot(data, aes(x=hour, y=count, color=weather))+geom_point(data = weather_summary_by_hour, aes(group = weather))+geom_line(data = weather_summary_by_hour, aes(group = weather))+ggtitle("Bikes Rent By Weather and Hour")+ scale_colour_hue('Weather',breaks = levels(data$weather), labels=c('Good', 'Normal', 'Bad', 'Very Bad'))
```


From this plot it is seen that, 
1. People rent bikes the most when weather is good and between hours 17 - 19
2. We see bike rentals are minimum during the early hours of the morning and during bad weather


#### Looking over the bikes rented by the weekdays and hour

```{r fig.height = 7, fig.width = 13}
# Get the average count of bikes rent by day, hour
day_summary_by_hour <- sqldf('select day, hour, avg(count) as count from data group by day, hour')

ggplot(data, aes(x=hour, y=count, color=day, label= TRUE))+geom_point(data = day_summary_by_hour, aes(group = day))+geom_line(data = day_summary_by_hour, aes(group = day))+ggtitle("Bikes Rent By the days and Hours")+ scale_colour_hue('Weekday',breaks = levels(data$day), labels=c('Monday','Tuesday','Wednesday','Thursday','Friday','Saturday','Sunday'))
```

Here we can see that, 
1. There are more bikes rented on weekdays during 7-9 hours in the morning and 17-19 hours in the evening
2. There are more bikes rentals between 12 - 16 hours on weekends