-
Notifications
You must be signed in to change notification settings - Fork 3
/
Copy pathGroup Project 2 - Exploratory Data Analysis.Rmd
118 lines (89 loc) · 3.68 KB
/
Group Project 2 - Exploratory Data Analysis.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
---
title: "Group Project 2 - Exploratory Data Analysis"
author: "Mohamed Jalaly"
date: "3/15/2022"
output: pdf_document
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
library(readr)
library(ggplot2)
library(tidyr)
library(dplyr)
library(kableExtra)
library(scales)
library(skimr)
library(magrittr)
animals = read_csv("dataset11.csv")
animals_with_date = read_csv("dataset11_with_date.csv")
```
# Exploratory Data Analysis
First, we would like to understand the distribution of different animal types in our dataset from the shelter. This would give us a better idea if a single type of animal might be influencing the fit of the data.
```{r animaltypes}
animals %>%
group_by(animal_type) %>%
summarize(counts=n()) %>%
mutate(proportion=percent(counts/sum(counts))) %>%
kable()
ggplot(animals, mapping=aes(x=animal_type, fill=factor(animal_type))) +
geom_bar(aes(y = (..count..)/sum(..count..))) +
scale_y_continuous(labels=percent)+
labs(title="Frequency of Animal Type in Shelter Data",
x="Animal Type", y="Percentage")
```
Given the relatively very low number of observations for Birds, Wildlife, and Livestock animal types, it is suggested to combine these three categories into one, called 'others' for the formal analysis.
Additionally, we would like to investigate the distribution of animals by the method of arriving at the shelter.
```{r intaketype}
# animals$date = as.yearmon(paste(animals$year, animals$month), "%Y %m")
animals %>%
group_by(intake_type) %>%
summarize(counts=n()) %>%
mutate(proportion=percent(counts/sum(counts))) %>%
kable()
ggplot(animals, mapping=aes(x=intake_type, fill=factor(intake_type))) +
geom_bar(aes(y = (..count..)/sum(..count..))) +
scale_y_continuous(labels=percent)+
labs(title="Frequency of Intake Type in Shelter Data",
x="Intake Type", y="Percentage")
ggplot(animals, mapping=aes(x=intake_type, fill=factor(animal_type))) +
geom_bar(aes(y = (..count..)/sum(..count..))) +
scale_y_continuous(labels=percent)+
labs(title="Frequency of Intake Type in Shelter Data",
x="Intake Type", y="Percentage")
```
Additionally, we want to look at the animals arrival trend to the shelter.
```{r arrivaldistribution, eval=FALSE, echo=TRUE}
# adding a date column
animals_with_date = animals
animals_with_date$date = as.yearmon(paste(animals$year, animals$month), "%Y %m")
animals_with_date %>%
group_by(date) %>%
summarize(Animals_arriving=n()) %>%
ggplot(mapping = aes(x=date, y=Animals_arriving))+
geom_line(col="blue")+
labs(title="Animals Arrival Timeseries", x="Date", y="Animal Arrival")
```
Furthermore, let us look at the summary statistics of the Time at Shelter, which is our variable of interest.
```{r summstat}
animals %>%
select(time_at_shelter) %>%
skim() %>%
kable
animals %>%
select(time_at_shelter) %>%
ggplot(mapping=aes(x=time_at_shelter))+
geom_histogram(col="white", binwidth = 3)
```
Finally, as the outcome_type is determined after the time_at_shelter, it may not have a direct influence on the time_at_shelter. However, it is worth investigating whether the time_at_shelter is related to the outcome_type.
```{r outcome}
animals %>%
group_by(outcome_type) %>%
ggplot(mapping=aes(y=time_at_shelter, color=factor(outcome_type)))+
geom_boxplot()
animals %>%
group_by(outcome_type) %>%
summarize(mean.time.at.shelter = round(mean(time_at_shelter),0),
median.time.at.shelter = round(median(time_at_shelter),2),
max.time.at.shelter=round(max(time_at_shelter),2)) %>%
kable()
```