-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathREADME.Rmd
255 lines (200 loc) · 13.5 KB
/
README.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
---
title: "Retrospective strength of schedule in the NBA (shiny app)"
output: github_document
---
```{r load, message=FALSE, warning=FALSE, include=FALSE, paged.print=FALSE}
# Clear Workspace
rm(list = ls())
options(scipen = 999)
options(stringsAsFactors = FALSE)
# load packages
library(formatR)
library(tidyverse)
library(ggthemes)
library(lubridate)
library(ggrepel)
library(knitr)
library(kableExtra)
library(patchwork)
library(stringi)
library(ggridges)
library(viridis)
local <- TRUE
# read data
if (local) {
eloDat <- read.table("538elo.csv", header = TRUE, dec = ".", sep = ";")
rm(local)
} else {
url538 <- "https://projects.fivethirtyeight.com/nba-model/nba_elo.csv"
eloDat <- read.table(url538, header = TRUE, sep = ",", dec = ".", na.strings = "")
eloDat$date <- ymd(eloDat$date)
}
# Double the data
eloDatRe <- eloDat
colnames(eloDatRe) <- stri_replace_all_fixed(colnames(eloDatRe),
c(1,2), c(2,1), mode = "first")
eloList <- list(eloDat, eloDatRe)
eloDat <- data.table::rbindlist(eloList, use.names=TRUE, idcol=TRUE)
# Make conference dat
confDat <- data.frame(team1 = sort(unique(eloDat$team1[eloDat$season >= 1977])), conference = c(rep("East",
8), "West", "West", "East", "West", "West", "East", "West", rep("West", 3), "East", "East", "West",
"East", "East", "West", "West", "East", "East", "West", "East", "East", rep("West", 6), "East", "West",
"West", "East", "East"))
# Pre-Calc data
tempElo <- eloDat %>% mutate(playoffgame = !is.na(playoff)) %>% group_by(season, team1) %>% summarise_if(is.logical,
sum) %>% right_join(eloDat, by = c("season", "team1")) %>% mutate(playoffteam = playoffgame > 0) %>%
filter(season >= 1977) %>% left_join(confDat, by = "team1") %>% mutate(playoffgame = !is.na(playoff))
```
[Link to shiny app](https://georgeblck.shinyapps.io/oppoElo/)
During the regular season a common measure of a teams remaining strength of schedule (SOS) is the combined winning percentage of
those opponents [(Example)](http://www.tankathon.com/remaining_schedule_strength). A high winning percentage means the opponents you still
have to face are probably tough and vice-versa.
There are two ideas I want to apply in this work that are in a similar vein:
1. Substitute the winning percentage with a more adequate measure of performance: the Elo score.
2. Instead of a *remaining* SOS, calculate a *retrospective* SOS.
The first addition is self explanatory and through 538s great work very easy to implement. The latter idea stems from my interest in questions like
* Who had the hardest or easiest season in 2019?
* Who had the hardest or easiest season *ever*?
* Who had the hardest playoff run ever?
All of these results depend on there being some amount of variation in opponents team strength (measured via elo). This is why it is important to first look at this variation.
## Results
```{r ridges, echo=FALSE, message=FALSE, warning=FALSE, paged.print=FALSE}
seasonData <- tempElo %>% filter(is.na(playoff)) %>% group_by(season, team1) %>%
summarise_at(vars(elo1_pre, elo2_pre), mean) %>% ungroup() %>% left_join(confDat, by = "team1") %>%
mutate(semidecade = season - season %% 5)
seasonData %>% filter(season >= 1980) %>% ggplot( aes(x = elo2_pre, y = factor(semidecade), fill = ..x..)) +
geom_density_ridges_gradient(scale = 0.95) +
scale_fill_viridis(name = "", option = "C")+
xlab("Average Opponent Elo") + ylab("Seasons starting from ...") + theme_tufte(base_size = 15)+facet_wrap(~conference)+
theme(legend.position="none")
ggplot( seasonData,aes(x=elo2_pre, y=conference, fill=factor(..quantile..))) +
stat_density_ridges(scale = 0.95,
geom = "density_ridges_gradient", calc_ecdf = TRUE,
quantiles = 4, quantile_lines = TRUE
) +
scale_fill_viridis(discrete = TRUE, name = "Quartiles")+theme_tufte(base_size = 15)+xlab("Average Opponent Elo")+
ylab("Conference")
```
### Variation in Opponents Elo
The variation in Opponents Elo is a lot smaller than the general Team Elo (i.e. the strength of teams and not their opponents).
That is to be expected as the strength of a team can only vary so much through a season. The highest amount of variation occurs in the 1999 season. As far as I can tell the reason is the strength of the westen conference.
```{r sd, echo=FALSE, message=FALSE, warning=FALSE, paged.print=FALSE}
tempElo %>% filter(is.na(playoff)) %>% group_by(season, team1) %>% summarise_at(vars(elo1_pre,
elo2_pre), mean) %>% ungroup() %>% group_by(season) %>% summarise_at(vars(elo1_pre, elo2_pre), sd,
na.rm = TRUE) %>% ungroup() %>% ggplot(aes(x = season, y = elo2_pre)) + geom_point() + geom_line() +
theme_minimal() + geom_rangeframe(sides = "l")+xlab("Season")+ylab("Variation in season-wise opponents Elo")
tempElo %>% filter(is.na(playoff)) %>% group_by(season, team1) %>% summarise_at(vars(elo1_pre,
elo2_pre), mean) %>% ungroup() %>% group_by(season) %>% summarise_at(vars(elo1_pre, elo2_pre), sd,
na.rm = TRUE) %>% ungroup() %>% ggplot(aes(x = season, y = elo1_pre)) + geom_point() + geom_line() +
theme_minimal() + geom_rangeframe(sides = "l")+xlab("Season")+ylab("Variation in season-wise Team Elo")
```
### Intra-Season Variation in Elo and oppoElo
Let us now look on the highest deviations in terms of Elo in the past NBA throught a single season.
We see that the 2011 Cavs had the highest difference in their own team Elo. This is due to Lebron James leaving during the offseason of 2010 and subsequent skill loss. The same thing happened to Chicago in 1999 after Michael Jordans second retirement.
The opposite happened to the 2008 Boston Celtics: Pierce, Allen & Garnett got them the championsip.
Summing up: Big changes in a teams Elo through a season are mainly due to player movements.
```{r delta1, echo=FALSE, message=FALSE, warning=FALSE, paged.print=TRUE}
eloDelta <- tempElo %>% group_by(season, team1) %>%
summarise(oppoMax = max(elo2_pre), oppoMin = min(elo2_pre), ownMax = max(elo1_pre), ownMin = min(elo1_pre)) %>%
mutate(oppoDiff = oppoMax - oppoMin, ownDiff = ownMax - ownMin) %>% ungroup()
eloDelta %>% select(team1, season, ownDiff, ownMax, ownMin)%>% arrange(-ownDiff) %>% top_n(10, ownDiff) %>%
kable("latex", digits=0, col.names = c("Team", "Season", "EloDiff", "Highest Elo", "Lowest Elo"), booktabs = T) %>%
kable_styling(latex_options = "striped") %>%
row_spec(1:3, color ="red") %>% as_image(width = 4, file = "README_files/figure-gfm/teamdelta.png")
```
Opponents Elo is not as informative:
Portland faced both the Warriors at their Elo apex and the pre-processed 76ers in 2016. Most of the other outliers are due to either GSW or Michael Jordan.
```{r delta2, echo=FALSE, message=FALSE, warning=FALSE, paged.print=FALSE}
eloDelta %>% select(team1, season, oppoDiff, oppoMax, oppoMin)%>% arrange(-oppoDiff) %>% top_n(10, oppoDiff) %>%
kable("latex", digits=0, col.names = c("Team", "Season", "EloDiff", "Highest Oppo Elo", "Lowest Oppo Elo"), booktabs = TRUE) %>%
kable_styling(latex_options = "striped") %>% row_spec(1, color ="red") %>% as_image(width = 4, file = "README_files/figure-gfm/oppodelta.png")
```
### Who had the hardest season/playoff run?
```{r hardestseason, echo=FALSE, message=FALSE, warning=FALSE, paged.print=FALSE}
kDat <- tempElo %>% filter(is.na(playoff)) %>% group_by(season, team1) %>%
summarise_at(vars(elo1_pre, elo2_pre, playoffteam), mean) %>% ungroup() %>% left_join(confDat, by = "team1") %>%
group_by(season) %>% mutate(elo1_z = as.numeric(scale(elo1_pre)), elo2_z = as.numeric(scale(elo2_pre))) %>% ungroup() %>%
arrange(-elo2_pre) %>% mutate(playoffteam = factor(playoffteam, levels = 0:1, labels = c("No", "Yes"))) %>% top_n(10, elo2_pre) %>%
select(-elo1_pre, -elo2_z)
kDat %>%
kable("latex", digits = 2, col.names = c("Season", "Team", "Avg Oppo Elo", "Reached Playoffs?", "Conference", "Teams relative strength"), booktabs = TRUE, caption = "Top 10 hardest NBA seasons") %>%
kable_styling(latex_options = "striped", font_size = 5) %>% row_spec(which(kDat$elo1_z < 0), bold = F, color = "white", background = "red") %>%
footnote(general = "Red rows signify teams that were weaker than average in the respective season") %>%
as_image(width = 4, file = "README_files/figure-gfm/hardseason.png")
kDat <- tempElo %>% filter(is.na(playoff)) %>% group_by(season, team1) %>%
summarise_at(vars(elo1_pre, elo2_pre, playoffteam), mean) %>% ungroup() %>% left_join(confDat, by = "team1") %>%
group_by(season) %>% mutate(elo1_z = as.numeric(scale(elo1_pre)), elo2_z = as.numeric(scale(elo2_pre))) %>% ungroup() %>%
arrange(elo2_pre) %>% mutate(playoffteam = factor(playoffteam, levels = 0:1, labels = c("No", "Yes"))) %>% top_n(10, -elo2_pre) %>%
select(-elo1_pre, -elo2_z)
kDat %>%
kable("latex", digits = 2, col.names = c("Season", "Team", "Avg Oppo Elo", "Reached Playoffs?", "Conference", "Teams relative strength"), booktabs = TRUE, caption = "Top 10 easiest NBA seasons") %>%
kable_styling(latex_options = "striped", font_size = 5) %>% row_spec(which(kDat$elo1_z < 0), bold = F, color = "white", background = "red") %>%
footnote(general = "Red rows signify teams that were weaker than average in the respective season") %>%
as_image(width = 4, file = "README_files/figure-gfm/easyseason.png")
```
The same for playoff runs
```{r hardestplayoff, echo=FALSE, message=FALSE, warning=FALSE, paged.print=FALSE}
zDat <- tempElo %>% filter(is.na(playoff)) %>% group_by(season, team1) %>%
summarise_at(vars(elo1_pre, elo2_pre, playoffteam), mean) %>% ungroup() %>%
group_by(season) %>% mutate(elo1_z = as.numeric(scale(elo1_pre)), elo2_z = as.numeric(scale(elo2_pre))) %>% ungroup() %>%
select(-playoffteam, -elo2_pre , -elo1_pre)
yo <- tempElo %>% filter(!is.na(playoff)) %>% add_count(season, team1, name = "playoffgames") %>% group_by(season, team1) %>%
group_by(season, team1) %>% mutate(count = n_distinct(team2)) %>% ungroup()
tempElo %>% filter(!is.na(playoff)) %>% add_count(season, team1, name = "playoffgames") %>% group_by(season, team1) %>%
summarise_at(vars(elo1_pre, elo2_pre, playoffgames), mean) %>% ungroup() %>% left_join(confDat, by = "team1") %>%
arrange(-elo2_pre) %>% top_n(10, elo2_pre) %>% left_join(zDat, by = c("season", "team1")) %>%
select( -elo1_pre, -elo2_z) %>%
kable("latex", digits = 2, col.names = c("Season", "Team", "Avg Oppo Elo", "#Games", "Conference", "Team strength during season"), booktabs = TRUE,
caption = "Top 10 hardest playoff runs") %>%
kable_styling(latex_options = "striped", font_size = 5) %>%
as_image(width = 4, file = "README_files/figure-gfm/hardplayoff.png")
tempElo %>% filter(!is.na(playoff)) %>% add_count(season, team1, name = "playoffgames") %>% group_by(season, team1) %>%
summarise_at(vars(elo1_pre, elo2_pre, playoffgames), mean) %>% ungroup() %>% left_join(confDat, by = "team1") %>%
arrange(elo2_pre) %>% top_n(10, -elo2_pre) %>% left_join(zDat, by = c("season", "team1")) %>%
select( -elo1_pre, -elo2_z) %>%
kable("latex", digits = 2, col.names = c("Season", "Team", "Avg Oppo Elo", "#Games", "Conference", "Team strength during season"), booktabs = TRUE,
caption = "Top 10 easiest playoff runs") %>%
kable_styling(latex_options = "striped", font_size = 5) %>%
as_image(width = 4, file = "README_files/figure-gfm/easyplayoff.png")
```
### Increased variation in seasonal performance
There seems to be an increase in seasonal variation per team elo.
```{r echo=FALSE, message=FALSE, warning=FALSE, paged.print=FALSE}
eloDelta %>% group_by(season) %>% summarise_at(vars(ownDiff),mean)%>%ggplot(aes(x=season,y=ownDiff))+geom_point()+geom_line()+theme_minimal()+geom_smooth(method = "lm", se = FALSE)+xlab("Season")+ylab("Average Range Difference in Team Elo (Max-Min)")
eloDelta %>% group_by(season) %>% summarise_at(vars(oppoDiff),mean)%>%ggplot(aes(x=season,y=oppoDiff))+geom_point()+geom_line()+theme_minimal()+geom_smooth(method = "lm", se = FALSE)+xlab("Season")+ylab("Average Range Difference in oppo Team Elo (Max-Min)")
```
## Data
* All games since 1977 season (NBA/ABA Merger)
* All taken from 538s Elo database. [link](https://projects.fivethirtyeight.com/nba-model/nba_elo.csv)
## Update Repo Oct 2020
* [x] Make better menu for choice of Raptor/Elo/Carmelo
* [x] Make better menu for choice of season/playoff/both
* [x] Subset data when Carmelo/Raptor is chosen
* [x] Make legend much bigger
* [x] Make other text bigger
* [x] Make axis labels with "strength"
* [ ] Write more explaining text?
* [x] Add colour for won finals
* [x] Update data but dont include 2020 season
* [x] Add Champion tag to table
## To-Do
* [x] Upload shiny to shinyapps.io
* [x] Add more seasons (by solving conferences, use elo website ) (no)
* [x] Add carmElo for the recent seasons (via button)
* [x] Make the graphic bigger in browser
* [x] Look into the max/min delta a team can undergo in a season
* [x] Visualize the variation in oppoElo and teamElo
* [x] Make multiple seasons selectable
* [x] Replace team names if it's only name change (via button)
* [ ] How similar are the different Elo scores (logreg and visual)
* [ ] Brush up text.
* [x] Save table info so it shows up in github
* [x] carmelo only when looking at single year? (no)
* [x] Split by season --> who had the hardest season/playoff
* [x] Add table for searchable hardest/easiest season/playoff run with nice kable-highlights
* [x] Table: Make oppoElo column fat
* [x] Table: filters for double variables to integer or round
* [x] Plot: Make labels bigger and change colour scheme to 538
* [x] Think about doubling and possible problems
* [x] Ridge plot for seasons West/East and over decades
* [x] Add teams till 1970. Problem with ABA and conferences