This repository has been archived by the owner on Aug 27, 2021. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 0
/
index.Rmd
142 lines (108 loc) · 3.88 KB
/
index.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
---
title: "Summaries"
date: "2019-03-18"
output: html_document
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
```{r}
library(tidyverse)
library(nycflights13)
```
dplyr summarise() function can be used to calculate counts and proportions of logical values: sum(x > 10), mean(y == 0).
When used with numeric functions, TRUE is converted to 1 and FALSE to 0.
> This makes sum() and mean() very useful: sum(x) gives the number of TRUEs in x, and mean(x) gives the proportion.
### Exercises
1. Using nycflights13 "flights" dataset, brainstorm at least 5 different ways to assess the typical delay characteristics of a group of flights.
```{r}
flights
```
Consider the following scenarios:
- A flight is 15 minutes early 50% of the time, and 15 minutes late 50% of the time.
```{r}
flights %>%
group_by(flight) %>%
summarize(n = n(),
early15 = mean(arr_delay <= -15, na.rm = TRUE),
late15 = mean(arr_delay >= 15, na.rm =TRUE)) %>%
filter(early15 == 0.5 & late15 == 0.5)
```
- A flight is always 10 minutes late.
```{r}
flights %>%
group_by(flight) %>%
summarize(n = n(),
late10 = mean(arr_delay >= 10, na.rm = TRUE)) %>%
filter(late10 == 1)
```
- A flight is 30 minutes early 50% of the time, and 30 minutes late 50% of the time.
```{r}
flights %>%
group_by(flight) %>%
summarize(n = n(),
early30 = mean(arr_delay <= -30, na.rm = TRUE),
late30 = mean(arr_delay >= 30, na.rm =TRUE)) %>%
filter(early30 == 0.5 & late30 == 0.5)
```
- 99% of the time a flight is on time. 1% of the time it’s 2 hours late.
```{r}
flights %>%
group_by(flight) %>%
summarize(n = n(),
ontime = mean(arr_delay == 0, na.rm = TRUE),
late120 = mean(arr_delay >= 120, na.rm =TRUE)) %>%
filter(ontime == 0.99 & late120 == 0.01)
```
- Which is more important: arrival delay or departure delay?
Saabumise hilinemine, sest ei pruugi jõuda järgmisele lennule vms.
2. Come up with another approach that will give you the same output as not_cancelled %>% count(dest) and not_cancelled %>% count(tailnum, wt = distance) (without using count()).
```{r}
not_cancelled <- flights %>%
filter(!is.na(dep_delay), !is.na(arr_delay))
```
```{r}
not_cancelled %>% count(dest)
not_cancelled %>%
group_by(dest) %>%
summarise(n = length(dest))
```
```{r}
not_cancelled %>% count(tailnum, wt = distance)
not_cancelled %>%
group_by(tailnum) %>%
summarise(n = sum(distance))
```
3. Our definition of cancelled flights (is.na(dep_delay) | is.na(arr_delay) ) is slightly suboptimal. Why? Which is the most important column?
Juhul kui lend läheb küll välja, kuid ei saabu sihtkohta, ei ole ta tühistatud (näiteks suunati mingil põhjusel mujale ümber). Seega on dep_delay tühistamise määramiseks olulisem.
4. Look at the number of cancelled flights per day. Is there a pattern? Is the proportion of cancelled flights related to the average delay?
```{r}
cancelled_flights <-
flights %>%
mutate(cancelled = (is.na(dep_delay))) %>%
group_by(year, month, day) %>%
summarise(
prop_cancelled = mean(cancelled),
avg_delay = mean(dep_delay, na.rm = TRUE)
)
cancelled_flights
```
```{r}
ggplot(cancelled_flights, aes(x = avg_delay, prop_cancelled)) +
geom_point(size=0.5) +
geom_smooth(se = FALSE)
```
Päevadel, kui on rohkem tühistatud lende, on ka suuremad hilinemised.
5. Which carrier has the worst delays? Challenge: can you disentangle the effects of bad airports vs. bad carriers? Why/why not? (Hint: think about flights %>% group_by(carrier, dest) %>% summarise(n()))
```{r}
flights %>%
group_by(carrier) %>%
summarise(arr_delay = mean(arr_delay, na.rm = TRUE)) %>%
arrange(desc(arr_delay))
```
6. What does the sort argument to count() do. When might you use it?
Sorteerib loendatava suuruse kahanevas järjekorras. Arrange asemel.
```{r}
flights %>%
count(flight, sort = TRUE)
```