-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathindex.Rmd
306 lines (206 loc) · 10.4 KB
/
index.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
---
title: "EURO2016"
author: Tuomo Nieminen
date: "`r Sys.Date()`"
output:
html_document:
theme: cosmo
df_print: paged
code_folding: hide
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE, comment = "")
```
![](em_quarters.png)
# Welcome
This GitHub page describes predicting the 2016 european soccer championships matches using ordinal regression. The GitHub repository related to this page is [here](https://github.com/TuomoNieminen/EURO2016).
Articles (in finnish) describing the analysis and the predictions for the EURO 2016 finals are available in tyyppiarvo.com:
- [Group stage prediction article](http://tyyppiarvo.com/2016/06/jalkapallossa-kannattaa-laukoa-ennustimme-em-turnauksen-alkuvaiheen-ottelut-ja-jatkoonmenijat/)
- [Round of sixteen prediction article](http://tyyppiarvo.com/2016/06/tyyppiarvo-tutki-32768-mahdollista-skenaariota-belgia-saksa-todennakoisin-em-finaali/)
- [Quarterfinal prediction article](http://tyyppiarvo.com/2016/06/paivitetty-em-futisennuste-islanti-kuusinkertaisti-mahdollisuutensa-euroopan-mestariksi/)
Note that all articles are in Finnish.
# 2008 & 2012 qualifier and final tournament data
I combined the statistics from the 2008 and 2012 qualifiers ("homeq" and "awayq" features) with the results of the EURO 2008 and 2012 final tournaments and studied how the qualifier statistics could be used to predict match outcomes.
```{r}
emq <- get(load("data/EUROq.Rda"))
emq <- emq[complete.cases(emq),]
emq
```
## Feature engineering
Since only the relative strength of two teams can be though to predict the outcome of a match, I studied the ratios and differences of the qualifier statistics as predictors for the final tournament match goal differences and -outcomes.
```{r}
# ratios and differences of team stats
homevars <- substr(names(emq),1,5)=="homeq"
awayvars <- substr(names(emq),1,5)=="awayq"
#fifa
homevars[7] <- T
awayvars[8] <- T
ratios <- emq[,homevars] / emq[,awayvars]
diffs <- emq[,homevars] - emq[,awayvars]
names(ratios) <- gsub("home", "", names(ratios))
names(diffs) <- names(ratios)
```
## Correlations of qualifier stats and final tournament goal difference
I studied the connections of hometeam vs. awayteam qualifier statistics ratios and differences with the final tournament match goal differences, using scatter plots and correlations. I did this to choose which statistics and which version (ratio or difference) I should use as predictors for the match outcomes.
```{r}
#correlations
corrs <- rbind(cor_ratios=apply(ratios, 2, cor, emq$goal_diff),
cor_diffs=apply(diffs, 2, cor, emq$goal_diff))
t(corrs)
```
*Correlations of home vs awayteam qualifier statistic ratios and differences versus final tournament match goal difference. The strongest linear connections are given by the ratios, except with corners were the difference in qualifier corner kicks has the strongest connection to final tournament match goal difference.*
<br>
## Ratios of qualifier stats versus tournament match goal differences
```{r, fig.cap ="Linear connections between ratios of qualifier statistics and tournament match goal differences. All the statistics related to shot attempts have positive linear connections with the match goal difference."}
library(reshape2)
library(ggplot2)
df <- melt(cbind(goaldiff = emq$goal_diff, ratios), id.vars = "goaldiff")
ggplot(df, aes(value, goaldiff)) + geom_point() +geom_smooth(method = "lm") + facet_wrap("variable", scales = "free") + ggtitle("Qualifier statistic ratios vs. tournament goal differences, EURO 2008 & 2012")
```
# Feature selection with regularized regression
To study the explanatory power of the qualifier statistics more closely, I fit an ordinal regression model with the final tournament match outcome as the target variable and a selection of promising features (those correlating with the match goal difference) as the explanatory variables. I used lasso regularisation to study which features best explain the final tournament match outcome.
```{r, message = FALSE}
features <- cbind(ratios[,1:4], diffs[,8])
colnames(features) <- c("fifa_r","ontarget_r","offtarget_r", "shot_r","corners_d")
outcome <- emq$outcome
library(glmnetcr)
netfit <- glmnet.cr(features, outcome, maxit=500)
```
## Bayesian information criteria and coefficient path (1)
```{r, fig.cap = "Akaike information criteria (AIC) and the value of coefficients by the amount of regularization used (step) in ordinal regression, where final tournament match outcomes are predicted by ratios or differences in qualifier match statistics. The best solution is the one where AIC is lowest. Only the offtarget shot ratio deviates from zero at this point."}
par(mfrow = c(2,1), mar=c(4,4,2,10), cex.axis=0.7)
plot(netfit, xvar="step", type="aic")
plot(netfit, xvar = "step", type = "coefficients")
```
<br>
I decided it didn't make sense that somewhow offtarget shots were a better predictor than ontarget shots and total shots. So I excluded the off and ontarget shots and used only the ratio of the total shots.
```{r, message = FALSE}
features2 <- cbind(ratios[,c(1,4)], diffs[,8])
colnames(features2) <- c("fifa_r","shot_r","corners_d")
netfit2 <- glmnet.cr(features2, outcome, maxit=500)
```
## Bayesian information criteria and coefficient path (2)
```{r, fig.cap = "Akaike information criteria (AIC) and the value of coefficients by the amount of regularization used in ordinal regression, where final tournament match outcomes are predicted by ratios or differences in qualifier match statistics. The best solution is the one where AIC is lowest. Only the shot ratio deviates from zero at this point."}
par(mfrow = c(2,1), mar=c(4,4,2,10), cex.axis=0.7)
plot(netfit2, xvar="step", type="aic")
plot(netfit2, xvar = "step", type = "coefficients")
```
<br>
## Qualifier shot ratio versus final tournament match goal difference
```{r}
# a scatter plot of shots ratio and goal difference
emq$shot_ratio <- ratios[,4]
plot(emq$shot_ratio, emq$goal_diff,
xlab="SR",
ylab="match goal difference",
main="Qualifier shot ratio (SR)
versus tournament match goal difference
EM 2008 & 2012",
col="green4", pch=20)
abline(lm(goal_diff~shot_ratio, data = emq), col="darksalmon")
legend("bottomright",
legend=paste("correlation:",round(cor(emq$shot_ratio, emq$goal_diff),2)),
bty="n")
```
<br>
## Ordinal regression with shot ratio
I an ordinal regression models using only the qualifiers shot ratio as explanatory variable. The table shows the prediction made by the model and the data used to train the model (EURO 2008 % 2012 tournament matches).
```{r}
library(MASS)
fit1 <- polr(outcome ~ shot_ratio, Hess=T, data = emq)
cbind.data.frame(emq[,c("hometeam","awayteam","outcome")], predict(fit1, type="probs"))
```
# Uefa ratings
I also studied the pre-tournament uefa rating as a predictor to the match outcomes. The definition of the uefa rating changed after 2008 so only the data from 2012 could be used in the analysis.
## Correlation of pre-tournament uefa ratio and tournament match goal difference
```{r}
emu <- read.csv2("data/em2012_uefa.csv", stringsAsFactors = F)
emu$uefa_ratio <- emu$uefa_koti / emu$uefa_vieras
emu$goal_diff <- emu$maalit_koti - emu$maalit_vieras
with(emu,cor.test(uefa_ratio, goal_diff))
```
*The correlation between ratios of pre-tournament uefa rating and tournament match goal difference is positive and statistically significant.*
<br>
## Scatter plot of uefa ratio versus match goals
```{r}
with(emu,{
plot(uefa_ratio,goal_diff,
xlab="UCR",
ylab="match goal difference",
main="Uefa coefficient ratio (UCR)
versus tournament match goal difference \n EM 2012.",
col="green4", pch=20)
abline(lm(goal_diff~uefa_ratio), col="darksalmon")
legend("bottomright",
legend=paste("correlation:",round(cor(uefa_ratio, goal_diff),2)),
bty="n")}
)
```
<br>
## Ordinal regression with uefa ratio
Pre-tournament uefa ratio correlated reasonably with the final tournament match goals so I fit a second ordinal regression model using the uefa ratio as a predictor. The table shows the predictions made by the model and the data used to train the model (EURO 2012 tournament matches).
```{r}
# model based on 2012 uefa ratings
# define the outcome
emu$outcome <- 1
emu$outcome[emu$goal_diff > 0] <- 3
emu$outcome[emu$goal_diff < 0 ] <- 0
emu$outcome <- factor(emu$outcome,ordered = T, levels=c(0,1,3),labels=c("loss","draw","win"))
fit2 <- polr(outcome ~ uefa_ratio, Hess=T, data=emu)
cbind.data.frame(emu[,c("kotijoukkue","vierasjoukkue","outcome")], predict(fit2, type="probs"))
```
<br>
# The 2016 matches and predictions
I averaged the predictions given by the two ordinal regression models to predict the matches of the 2016 tournament, using the ratio of qualifier shots and uefa ratings as the predicting features.
```{r}
# load 2016 matches
matches2016 <- get(load("data/matches2016.Rda"))
# use mass library to fit ordinal regression models
library(MASS)
# use the models to predict match outcome probabilities for 2016
probs1 <- predict(fit1, newdata=matches2016, type="probs")
probs2 <- predict(fit2, newdata=matches2016, type="probs")
# combine the predictions with the match data
predictions2016 <- 0.5*(probs1+probs2)[,3:1]
alkulohko <- cbind(matches2016, predictions2016)
# exclude the explanatory variables
alkulohko2016 <- alkulohko[,c("date","group","hometeam","awayteam","win","draw","loss", "homeuefa", "awayuefa", "uefa_ratio", "homeavrg_shots", "awayavrg_shots", "shot_ratio")]
alkulohko2016
```
<br>
## Expected points per match
```{r}
# calculate expected points per match
getEV <- function(p) {sum(c(3,1,0)*p)}
homePoints <- apply(predictions2016, 1, getEV)
awayPoints <- apply(predictions2016[,3:1], 1, getEV)
alkulohko2016 <- cbind(alkulohko2016[, 1:7], homePoints, awayPoints)
# rounding
alkulohko2016[,5:9] <- round(alkulohko2016[,5:9], 2)
alkulohko2016
```
<br>
## Expected points by group
```{r}
# calculate expected points by group
# function
get_total_points <- function(data, team) {
homedata <- data[data$hometeam==team,]
awaydata <- data[data$awayteam==team,]
homeEV <- sum(homedata$homePoints)
awayEV <- sum(awaydata$awayPoints)
return(homeEV+awayEV)
}
group <- alkulohko2016$group
# group predictions
by(alkulohko2016,group,FUN=function(d) {
teams <- unique(c(d$hometeam,d$awayteam))
points <- t(sort(sapply(teams,get_total_points,data=d),decreasing = T))
rownames(points) <- "points"
points
})
```
<br>
<hr>
Tuomo Nieminen 2017
<br>