I came across this soccer dataset and decided to explore it using R.
We begin by loading the data.
library(DBI)
library(RSQLite)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
# Database connection.
con <- dbConnect(SQLite(),"database.sqlite")
# Names of tables.
names <- as.data.frame(dbListTables(con))
# Tables.
player <- dbReadTable(con,'Player')
playerAttr <- dbReadTable(con,'Player_Attributes')
# Disconnecting from the database.
dbDisconnect(con)
Let’s arrange the players by rating and find out who were the 50 players with the highest FIFA ratings during this period.
# Returns the name of player p.
getName <- function(p) {filter(player,p$player_api_id==player_api_id)$player_name}
# Top 50 distinct player attributes ordered by FIFA ratings.
top50 <- distinct(na.omit(arrange(playerAttr,desc(overall_rating))),player_api_id,.keep_all=TRUE)[1:50,]
# Show the names of the 50 players with the highest rankings in descending order.
sapply(c(1:nrow(top50)), function(i) paste(i,"-",getName(top50[i,]),"\n")) %>% message
## 1 - Lionel Messi
## 2 - Cristiano Ronaldo
## 3 - Gianluigi Buffon
## 4 - Wayne Rooney
## 5 - Gregory Coupet
## 6 - Xavi Hernandez
## 7 - Alessandro Nesta
## 8 - Andres Iniesta
## 9 - Iker Casillas
## 10 - John Terry
## 11 - Ronaldinho
## 12 - Thierry Henry
## 13 - Arjen Robben
## 14 - David Trezeguet
## 15 - Francesco Totti
## 16 - Franck Ribery
## 17 - Frank Lampard
## 18 - Kaka
## 19 - Luis Suarez
## 20 - Manuel Neuer
## 21 - Neymar
## 22 - Radamel Falcao
## 23 - Ze Roberto
## 24 - Zlatan Ibrahimovic
## 25 - Adriano
## 26 - Carles Puyol
## 27 - Cesc Fabregas
## 28 - Cris
## 29 - David Villa
## 30 - Eden Hazard
## 31 - Julio Cesar
## 32 - Luca Toni
## 33 - Lucio
## 34 - Nemanja Vidic
## 35 - Petr Cech
## 36 - Robin van Persie
## 37 - Samuel Eto'o
## 38 - Steven Gerrard
## 39 - Andrea Pirlo
## 40 - Bastian Schweinsteiger
## 41 - Carlos Tevez
## 42 - David Silva
## 43 - Dida
## 44 - Didier Drogba
## 45 - Diego
## 46 - Fernando Torres
## 47 - Gerard Pique
## 48 - Hernan Crespo
## 49 - Jamie Carragher
## 50 - Joaquin
As expected, Messi and Ronaldo occupy positions 1 and 2.
Which of these players are over 190 centimeters tall?
# Show players over 190 cm. tall.
filter(player, player_api_id%in%(top50$player_api_id) & height>190)[,-c(1,4,5)]
## player_api_id player_name height weight
## 1 30728 David Trezeguet 190.50 176
## 2 30720 Dida 195.58 187
## 3 37482 Gerard Pique 193.04 187
## 4 30717 Gianluigi Buffon 193.04 201
## 5 30709 Luca Toni 193.04 194
## 6 27299 Manuel Neuer 193.04 203
## 7 30865 Nemanja Vidic 190.50 194
## 8 30859 Petr Cech 195.58 198
## 9 35724 Zlatan Ibrahimovic 195.58 209
Which of these players weigh under 160 pounds?
# Show players under 160 pounds.
filter(player, player_api_id%in%(top50$player_api_id) & weight<160)[,-c(1,4,5)]
## player_api_id player_name height weight
## 1 30731 Andrea Pirlo 177.80 150
## 2 30955 Andres Iniesta 170.18 150
## 3 38817 Carlos Tevez 172.72 157
## 4 37459 David Silva 170.18 148
## 5 30909 David Villa 175.26 152
## 6 30924 Franck Ribery 170.18 159
## 7 30981 Lionel Messi 170.18 159
## 8 19533 Neymar 175.26 150
## 9 22543 Radamel Falcao 177.80 159
## 10 30843 Robin van Persie 187.96 157
## 11 39854 Xavi Hernandez 170.18 148
## 12 38843 Ze Roberto 172.72 159
Surprisingly, Robin van Persie is among these players.
What is the average height and weight of the top 50 highest ranked players?
# Show mean height and weight.
summarise(filter(player,player_api_id%in%(top50$player_api_id)),mean(height),mean(weight))
## mean(height) mean(weight)
## 1 182.4228 175.02
What is the preferred foot ratio?
# Show number of left and right footed players.
count(top50,preferred_foot)
## # A tibble: 2 x 2
## preferred_foot n
## <chr> <int>
## 1 left 10
## 2 right 40
Everything looks reasonable so far, but to further evaluate the quality of the data lets test the idea that similar players play at similar positions by grouping these players into four clusters (representing the four main positions in the game: goalkeeper, defender, midfielder, and striker).
# K-means object.
clusters <- kmeans(top50[,10:42],centers=4,iter.max=20,nstart=100)
# Returns the names of the players of the nth cluster.
clusterNames <- function(n) {
clust <- as.data.frame(split(top50,clusters$cluster)[n])
names(clust) <- names(top50)
sapply(c(1:nrow(clust)), function(i) getName(clust[i,]))
}
# Matrix of names of players per cluster.
playersPerCluster <- sapply(c(1:length(clusters$size)),clusterNames)
# Show names of players per cluster.
sapply(c(1:length(clusters$size)), function(i) (playersPerCluster[i]))
## [[1]]
## [1] "Gianluigi Buffon" "Gregory Coupet" "Iker Casillas" "Manuel Neuer"
## [5] "Julio Cesar" "Petr Cech" "Dida"
##
## [[2]]
## [1] "Wayne Rooney" "Xavi Hernandez" "Andres Iniesta"
## [4] "Ronaldinho" "Thierry Henry" "Francesco Totti"
## [7] "Frank Lampard" "Kaka" "Ze Roberto"
## [10] "Cesc Fabregas" "David Villa" "Samuel Eto'o"
## [13] "Steven Gerrard" "Andrea Pirlo" "Bastian Schweinsteiger"
## [16] "Diego" "Joaquin"
##
## [[3]]
## [1] "Alessandro Nesta" "John Terry" "Carles Puyol" "Cris"
## [5] "Lucio" "Nemanja Vidic" "Gerard Pique" "Jamie Carragher"
##
## [[4]]
## [1] "Lionel Messi" "Cristiano Ronaldo" "Arjen Robben"
## [4] "David Trezeguet" "Franck Ribery" "Luis Suarez"
## [7] "Neymar" "Radamel Falcao" "Zlatan Ibrahimovic"
## [10] "Adriano" "Eden Hazard" "Luca Toni"
## [13] "Robin van Persie" "Carlos Tevez" "David Silva"
## [16] "Didier Drogba" "Fernando Torres" "Hernan Crespo"
We can see that two of these clusters are composed exclusively by goalkeepers and defenders, while the other two also seem to distinguish between midfielders and strikers.