Report.Rmd

---
title: 'Cyclistic and Data Visualization: "Advanced, Straightforward, and Peeled"
  (Case Study)'
author: "Praveen Choragudi"
date: "`r Sys.Date()`"
output: html_document
---
## Cyclistic Bikes Full Year Analysis from Q4 of 2021 to Q3 of 2022

Based on Kevin Hartman's "'Sophisticated, Clear, and Polished': Divvy and Data Visualization" Divvy case study, which can be found at https://artscience.blog/home/divvy-dataviz-case-study, this analysis. This script's goal is to compile the Cyclistic data that has been obtained into a single dataframe and then perform a quick analysis to shed light on the fundamental question: "How do members and casual riders use Cyclistic bikes differently?"

Welcome to the case study on Cyclistic's bike sharing programme! which is a fictitious business. We will use the steps of the data analysis process—ask, prepare, process, analyse, communicate, and act—to provide answers to the important business issues. You may keep on track by using the Case Study Roadmap tables, which include directional questions and important tasks.

Install required packages
* tidyverse for data import and wrangling
* lubridate for date functions
* ggplot for visualization

```{r}
library(tidyverse)  #helps wrangle data
library(lubridate)  #helps wrangle date attributes
library(ggplot2)  #helps visualize data
getwd() #displays your working directory
```

### Preparing quarterly data by merging multiple csv files 

#### preparing file for Q4 of 2021
```{r}
library(data.table)
setwd("~/Desktop/Cyclistic_Bikes/Cyclistic_Bike_Share/Data/Trips/2021")
files<-list.files(pattern = ".csv")
temp<-lapply(files,fread,sep=",")
data<-rbindlist(temp)
write.csv(data,file="Cyclistic_Trips_2021_Q4.csv",row.names = FALSE)
```

#### preparing file for Q1 of 2022
```{r}
library(data.table)
setwd("~/Desktop/Cyclistic_Bikes/Cyclistic_Bike_Share/Data/Trips/2022/Q1")
files<-list.files(pattern = ".csv")
temp<-lapply(files,fread,sep=",")
data<-rbindlist(temp)
write.csv(data,file="Cyclistic_Trips_2022_Q1.csv",row.names = FALSE)
```

#### preparing file for Q2 of 2022
```{r}
library(data.table)
setwd("~/Desktop/Cyclistic_Bikes/Cyclistic_Bike_Share/Data/Trips/2022/Q2")
files<-list.files(pattern = ".csv")
temp<-lapply(files,fread,sep=",")
data<-rbindlist(temp)
write.csv(data,file="Cyclistic_Trips_2022_Q2.csv",row.names = FALSE)
```

#### preparing file for Q3 of 2022
```{r}
library(data.table)
setwd("~/Desktop/Cyclistic_Bikes/Cyclistic_Bike_Share/Data/Trips/2022/Q3")
files<-list.files(pattern = ".csv")
temp<-lapply(files,fread,sep=",")
data<-rbindlist(temp)
write.csv(data,file="Cyclistic_Trips_2022_Q3.csv",row.names = FALSE)
```

## STEP 1: COLLECT DATA

```{r}
# Upload Cyclistic datasets (csv files) here

q4_2021 <- read_csv("Cyclistic_Trips_2021_Q4.csv")
q1_2022 <- read_csv("Cyclistic_Trips_2022_Q1.csv")
q2_2022 <- read_csv("Cyclistic_Trips_2022_Q2.csv")
q3_2022 <- read_csv("Cyclistic_Trips_2022_Q3.csv")
```

## STEP 2: WRANGLING DATA AND COMBINING INTO A SINGLE FILE

#### Comparing column names each of the files. While the names don't have to be in the same order, they DO need to match perfectly before we can use a command to join them into one file.


```{r}
colnames(q4_2021)
colnames(q1_2022)
colnames(q2_2022)
colnames(q3_2022)
```

#### Inspecting the dataframes and looking for incongruencies

```{r}
str(q4_2021)
str(q1_2022)
str(q2_2022)
str(q3_2022)
```

#### Stacking individual quarter's data frames into one big data frame

```{r}
all_trips <- bind_rows(q4_2021, q1_2022, q2_2022, q3_2022)
```

#### Removing lat, long, and gender fields as this data was dropped beginning in 2020

```{r}
all_trips <- all_trips %>%  
  select(-c(start_lat, start_lng, end_lat, end_lng))
```

## STEP 3: CLEAN UP AND ADD DATA TO PREPARE FOR ANALYSIS

### Inspecting the new table that has been created

```{r}
colnames(all_trips)  #List of column names
nrow(all_trips)  #How many rows are in data frame?
dim(all_trips)  #Dimensions of the data frame?
head(all_trips)  #See the first 6 rows of data frame.  Also tail(all_trips)
str(all_trips)  #See list of columns and data types (numeric, character, etc)
summary(all_trips)  #Statistical summary of data. Mainly for numerics
```

##### Adding columns that list the date, month, day, and year of each ride which allows us to aggregate ride data for each month, day, or year before completing these operations we could only aggregate at the ride level more on date formats in R found at that [link](https://www.statmethods.net/input/dates.html).

```{r warning=FALSE}
all_trips$date <- as.Date(all_trips$started_at) #The default format is yyyy-mm-dd
all_trips$month <- format(as.Date(all_trips$date), "%m")
all_trips$day <- format(as.Date(all_trips$date), "%d")
all_trips$year <- format(as.Date(all_trips$date), "%Y")
all_trips$day_of_week <- format(as.Date(all_trips$date), "%A")
```

#### Adding a "ride_length" [calculation](https://stat.ethz.ch/R-manual/R-devel/library/base/html/difftime.html) to all_trips (in seconds) 

```{r}
all_trips$ride_length <- difftime(all_trips$ended_at,all_trips$started_at)
```

### Inspecting the structure of the columns
```{r}
str(all_trips)
```

### Converting "ride_length" from Factor to numeric so we can run calculations on the data

```{r}
is.factor(all_trips$ride_length)
all_trips$ride_length <- as.numeric(as.character(all_trips$ride_length))
is.numeric(all_trips$ride_length)
```

#### Removing "bad" data. The [dataframe](https://www.datasciencemadesimple.com/delete-or-drop-rows-in-r-with-conditions-2/) includes a few hundred entries when bikes were taken out of docks and checked for quality by Cyclistic or ride_length was negative. We will create a new version of the dataframe (v2) since data is being removed.

```{r}
all_trips_v2 <- all_trips[!(all_trips$ride_length<=0),]
```

## STEP 4: CONDUCT DESCRIPTIVE ANALYSIS

### Descriptive analysis on ride_length (all figures in seconds)

```{r}
mean(all_trips_v2$ride_length) #straight average (total ride length / rides)
median(all_trips_v2$ride_length) #midpoint number in the ascending array of ride lengths
max(all_trips_v2$ride_length) #longest ride
min(all_trips_v2$ride_length) #shortest ride
```

#### We can condense the four lines above to one line using summary() on the specific attribute
```{r}
summary(all_trips_v2$ride_length)
```

#### Comparing members and casual users

```{r}
aggregate(all_trips_v2$ride_length ~ all_trips_v2$member_casual, FUN = mean)
aggregate(all_trips_v2$ride_length ~ all_trips_v2$member_casual, FUN = median)
aggregate(all_trips_v2$ride_length ~ all_trips_v2$member_casual, FUN = max)
aggregate(all_trips_v2$ride_length ~ all_trips_v2$member_casual, FUN = min)
```

#### We can see the average ride time by each day for members vs casual users
```{r}
aggregate(all_trips_v2$ride_length ~ all_trips_v2$member_casual + all_trips_v2$day_of_week, FUN = mean)
```

#### Notice that the days of the week are out of order. Let's fix that.

```{r}
all_trips_v2$day_of_week <- ordered(all_trips_v2$day_of_week, levels=c("Sunday", "Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday"))
```

#### Now, let's run the average ride time by each day for members vs casual users

```{r}
aggregate(all_trips_v2$ride_length ~ all_trips_v2$member_casual + all_trips_v2$day_of_week, FUN = mean)
```

```{r}
str(all_trips_v2)
```

#### Analyzing ridership data by type and weekday
```{r}
all_trips_v2 %>% 
  group_by(member_casual, day_of_week) %>%                #groups by usertype and weekday
  summarise(number_of_rides = n()                         #calculates the number of rides and average duration 
            ,average_duration = mean(ride_length)) %>%    # calculates the average duration
  arrange(member_casual, day_of_week)                     # sorts
```

#### Let's visualize the number of rides by rider type
```{r}
all_trips_v2 %>% 
  group_by(member_casual, day_of_week) %>% 
  summarise(number_of_rides = n()
            ,average_duration = mean(ride_length)) %>% 
  arrange(member_casual, day_of_week)  %>% 
  ggplot(aes(x = day_of_week, y = number_of_rides, fill = member_casual)) +
  geom_col(position = "dodge")
```

#### Let's create a visualization for average duration
```{r}
all_trips_v2 %>% 
  group_by(member_casual, day_of_week) %>% 
  summarise(number_of_rides = n()
            ,average_duration = mean(ride_length)) %>% 
  arrange(member_casual, day_of_week)  %>% 
  ggplot(aes(x = day_of_week, y = average_duration, fill = member_casual)) +
  geom_col(position = "dodge")
```

## STEP 5: EXPORT SUMMARY FILE FOR FURTHER ANALYSIS

```{r}
counts <- aggregate(all_trips_v2$ride_length ~ all_trips_v2$member_casual + all_trips_v2$day_of_week, FUN = mean)
write.csv(counts, file = '~/Desktop/Cyclistic_Bikes/avg_ride_length.csv')

chart1<-read.csv("avg_ride_length.csv")
colnames(chart1)<-c("Count","User_Type","Day_of_the_Week","Trip_Duration_in_Seconds")
colnames(chart1)
library(ggplot2)
ggplot(data=chart1)+geom_point(mapping = aes(x=Day_of_the_Week,y=Trip_Duration_in_Seconds, color=User_Type,shape=User_Type))+labs(title ="Usage by Members and Casual riders" ,subtitle ="Frequency of trip time between User Types",caption = "Data is from Q4(2021) and Q1-Q4(2022)")
```

###### THANK YOU