-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathbaseball_in_r.qmd
349 lines (238 loc) · 10.2 KB
/
baseball_in_r.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
---
title: "baseball_in_r"
author: "Keith Karani"
format: html
editor: visual
---
## Introduction
The Lahman package contains a database of pitching, hitting and fielding statistics from Major League Baseball from 1871 to 2022 including data from the present leagues American and National, and the four other major leagues, (American Association, Union Association, Player League, Federal League) and the National Association of 1871 to 1875.
### Data Dictionary
The data is comprised of the following main tables:
1. People - player names, date of birth, death and other biological information.
2. Batting - batting statistics
3. Pitching - pitching statistics
4. Fielding - fielding statistics
A collection of other tables is also provided:
Teams:
| | |
|----------------------------------------------------:|:------------------|
| [`Teams`](http://127.0.0.1:30963/help/library/Lahman/help/Teams) | yearly stats and standings |
| [`TeamsHalf`](http://127.0.0.1:30963/help/library/Lahman/help/TeamsHalf) | split season data for teams |
| [`TeamsFranchises`](http://127.0.0.1:30963/help/library/Lahman/help/TeamsFranchises) | franchise information |
| | |
Post-season play:
| | |
|-------------------------------------------------:|:---------------------|
| [`BattingPost`](http://127.0.0.1:30963/help/library/Lahman/help/BattingPost) | post-season batting statistics |
| [`PitchingPost`](http://127.0.0.1:30963/help/library/Lahman/help/PitchingPost) | post-season pitching statistics |
| [`FieldingPost`](http://127.0.0.1:30963/help/library/Lahman/help/FieldingPost) | post-season fielding data |
| [`SeriesPost`](http://127.0.0.1:30963/help/library/Lahman/help/SeriesPost) | post-season series information |
| | |
Awards:
| | |
|----------------------------------------------------:|:------------------|
| [`AwardsManagers`](http://127.0.0.1:30963/help/library/Lahman/help/AwardsManagers) | awards won by managers |
| [`AwardsPlayers`](http://127.0.0.1:30963/help/library/Lahman/help/AwardsPlayers) | awards won by players |
| [`AwardsShareManagers`](http://127.0.0.1:30963/help/library/Lahman/help/AwardsShareManagers) | award voting for manager awards |
| [`AwardsSharePlayers`](http://127.0.0.1:30963/help/library/Lahman/help/AwardsSharePlayers) | award voting for player awards |
| | |
Hall of Fame: links to People via `hofID`
| | |
|----------------------------------------------------:|:------------------|
| [`HallOfFame`](http://127.0.0.1:30963/help/library/Lahman/help/HallOfFame) | Hall of Fame voting data |
| | |
Information is different tables relating to a player is tagged with his playerID and are linked to names and birthdates in the People table.
Other tables:
[`AllstarFull`](http://127.0.0.1:30963/help/library/Lahman/help/AllstarFull) - All-Star games appearances; [`Managers`](http://127.0.0.1:30963/help/library/Lahman/help/Managers) - managerial statistics; [`FieldingOF`](http://127.0.0.1:30963/help/library/Lahman/help/FieldingOF) - outfield position data; [`ManagersHalf`](http://127.0.0.1:30963/help/library/Lahman/help/ManagersHalf) - split season data for managers; [`Salaries`](http://127.0.0.1:30963/help/library/Lahman/help/Salaries) - player salary data; [`Appearances`](http://127.0.0.1:30963/help/library/Lahman/help/Appearances) - data on player appearances; [`Schools`](http://127.0.0.1:30963/help/library/Lahman/help/Schools) - Information on schools players attended; [`CollegePlaying`](http://127.0.0.1:30963/help/library/Lahman/help/CollegePlaying) - Information on schools players attended, by player and year;
Variable label tables are provided for some of the tables:
[`battingLabels`](http://127.0.0.1:30963/help/library/Lahman/help/battingLabels), [`pitchingLabels`](http://127.0.0.1:30963/help/library/Lahman/help/pitchingLabels), [`fieldingLabels`](http://127.0.0.1:30963/help/library/Lahman/help/fieldingLabels)
::: {.callout-note appearance="simple"}
### Source
Lahman, S. (2023) Lahman's Baseball Database, 1871-2022, Main page, <https://www.seanlahman.com/baseball-archive/statistics/>
:::
#### Load packages to use
```{r}
library(Lahman)
library(tidyr)
library(dplyr)
library(ggplot2)
library(readr)
library(caret)
```
View the Lahman package to display the dataset with the data dictionary on the baseball data
```{r}
Teams
#View(Teams)
teams <- Teams
head(teams)
```
lets conduct exploratory data analysis
Winning a game in baseball is counted using run, so for our first exploration can we find the average number of runs made in every season in Major league baseball
```{r}
teams_runs <- teams |>
mutate(runs_game = R/(W + L))
head(teams_runs)
```
we can narrow down our analysis to find the average number of runs per games for every team for a given year for all teams
```{r}
runs_per_yr <- teams_runs |>
group_by(yearID) |>
summarize(mean_runs = mean(runs_game, na.rm = TRUE))
head(runs_per_yr)
# lets graph this summary and observe it over time
ggplot(runs_per_yr, aes(x = yearID, y = mean_runs)) +
geom_line() +
geom_point() +
labs(
title = "Average MLB Runs by Year",
caption = "Source: https://www.seanlahman.com/baseball-archive/statistics/"
) +
theme_minimal()
```
What team scored the most runs per year
```{r}
runs_teams <- Teams |>
group_by(name) |>
filter(yearID == 2022) |>
select(name, R)
#head(runs_teams)
# arrange the Runs in descending order to view wha team made the highest runs
arrange(runs_teams, desc(R))
# plot
ggplot(runs_teams, aes(x = name, y = R)) +
geom_bar(stat = "identity") +
coord_flip() +
labs(
title = "Runs scored by each team",
subtitle = "year 2022",
x = "Teams",
y = "Runs"
) +
theme_minimal()
```
What team scored the highest Homeruns in the year 2022
```{r}
homeruns <- Teams |>
group_by(name) |>
filter(yearID == 2022) |>
select(name, H)
arrange(homeruns, desc(H))
# plot
ggplot(homeruns, aes(x = name, y = H)) +
geom_bar(stat = "identity") +
coord_flip() +
labs(
title = "Homeruns by each team",
subtitle = "year 2022",
x = "Teams",
y = "Homeruns"
) +
theme_minimal()
```
How does different metrics compare to various teams
```{r}
# Restrict to AL and NL in mordern era
teams <- Teams |>
filter(yearID >= 2022 & lgID %in% c("AL", "NL")) |>
drop_na() |>
group_by(yearID, teamID) |>
mutate(TB = H + X2B + 2 * X3B + 3 * HR,
WinPct = W/G,
rpg = R/G,
hrpg = HR/G,
tbpg = TB/G,
kpg = SO/G,
k2bb = SO/BB,
whip = 3 * (H + BB) / IPouts)
# ggplot by year for selected team stats
yrPlot <- function(yvar, label)
{
ggplot(teams, aes_string(x = "yearID", y = yvar)) +
geom_point(size = 0.5) +
geom_smooth(method="loess") +
labs(x = "Year", y = paste(label, "per game"))
}
```
Plot of win percentage against run differential (R - RA)
```{r}
ggplot(teams, aes(x = R - RA, y = WinPct)) +
geom_point(size = 0.75) +
geom_smooth(method = "loess") +
geom_hline(yintercept = 0.5, color = "red") +
geom_vline(xintercept = 0, color = "orange") +
labs(
title = "Teams Win Percentage vs Run Differential",
x = "Run differential",
y = "Win percentage") +
theme_minimal()
```
Does fan attendance to game have an impact in the games outcome?
```{r}
teams |>
filter(yearID >= 2000) |>
ggplot(., aes(x = WinPct, y = attendance/1000)) +
geom_point(size = 0.5) +
geom_smooth(method = "loess", se = FALSE) +
facet_wrap(~ lgID) +
labs(x = "Win percentage", y = "Attendance in (1000s)")
```
Teams with over 4 million attendance in a season
```{r}
teams |>
filter(attendance >= 4e6) |>
select(yearID, lgID, teamID, Rank, attendance) |>
arrange(desc(attendance))
head(teams)
ggplot(teams, aes(x = yearID, y = attendance/1000)) +
geom_point() +
facet_wrap(~ lgID)
```
Average season Homeruns by Park, post-2000
```{r}
teams %>%
filter(yearID >= 2000) %>%
group_by(park) %>%
summarise(meanHRpg = mean((HR + HRA)/Ghome), nyears = n()) %>%
filter(nyears >= 20) %>%
arrange(desc(meanHRpg)) %>%
head(., 10)
head(teams)
```
Ofcos every baseball fan wants his/her team to win. Lets go ahead and create a model to predict wins by team
```{r}
base_df <- teams |>
drop_na() |>
select(name, yearID, W, L, R, H, X2B, X3B, HR, SO, RA) |>
filter(yearID >= 2009)
head(base_df)
```
lets train a linear model based on the variables we filtered above and find out how statistically significant they are.
```{r}
lm1 <- lm(W ~ R + H + X2B + X3B + HR + SO + RA, data = base_df)
summary(lm1)
#observation
# we can observe that tripples(X3B), double(x2b) and strike outs(SO) are not statically significant
```
we can create another model with only the statistically significant variable and compare
```{r}
lm2 <- lm(W ~ R + H + RA, data = base_df)
summary(lm2)
```
using our second model we can try predicting team wins
```{r}
preds <- predict(lm2, base_df)
#present the predicted value in a column to compare with the actual win value
base_df$pred <- preds
base_df
# plot the results
base_df |>
ggplot(aes(x = pred, y = W)) +
geom_point() +
geom_smooth() +
labs(
title = "Predicted wins against actual wins",
x = "Predicted Wins",
y = "Actual Wins"
) +
theme_minimal()
```