-
Notifications
You must be signed in to change notification settings - Fork 0
/
Swing and Miss.Rmd
710 lines (590 loc) · 35 KB
/
Swing and Miss.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
---
title: "Curveball Swing and Miss"
output:
html_document:
df_print: paged
---
### Objective
Create a model to evaluate pitcher's ability to generate swing and miss(whiff) with curveballs.
### Range of the Dataset
2083 CBs thrown by the Braves' pitchers in 2018 and 2019 (not completely true data).
### Types of inputs
- Pitcher_Throws: The pitcher’s handedness
- Batter_Hits: The batter’s handedness
- release_speed: The pitch’s velocity (mph)
- x_movement: The pitch’s horizontal movement (inches)
- z_movement: The pitch’s vertical movement (inches)
- release_spin_rate: The pitch’s spin rate (rpm)
- spin_dir: The pitch’s spin axis (degrees)
- release_pos_x: The horizontal release point for that pitch (ft)
- release_pos_z: The vertical release point for that pitch (ft)
- release_extension: The release extension for that pitch (ft)
- plate_x: The horizontal location of the ball when it crosses home plate (ft)
- plate_z: The vertical location of the ball when it crosses home plate (ft)
There are also a few context features like innings, strikes, runners, outs in this dataset. However, in this project we will focus more on the above features.
### Steps
- Data Preprocessing
- Exploratory Data Analysis
- Initial Logistic Regression Models
- Model 1: if the CB will generate a whiff based on the features given
- Model 2: if the CB will generate a whiff given that the batter swings at the pitch based on the features given
- GAM Models
- GAM with Interaction between Smoothed and Factor Variables
- GAM with Interaction between Smoothed Variables
- Calibration Plot
- SMAA and Split Half Correlation
- Random Forest and XGBoost Models
- CB Swing and Miss True Talent
- Mixed Model
- Takeaways
- Limitations and Potential Future Steps
### Data Preprocessing
**Initializing Code**
```{r setup}
# Load packages
box::use(
dplyr[...],
ggplot2[...],
bbdbc[...],
readr[write_rds, read_rds],
tidyr[...],
thematic[thematic_on],
here[here],
mgcv[...],
modelr[add_predictions],
caret[...],
performance[...],
olsrr[...],
MASS[...],
ggrepel[...],
lme4[...]
)
knitr::opts_chunk$set(warning = FALSE, message = FALSE)
```
**Load Data**
```{r filtering}
#Load the dataset
df_cb <- readRDS(here('df_cb.rds'))
```
```{r structure}
#Take a look at the structure of the dataset
str(df_cb)
```
**Data Type Conversion**
```{r type conversion}
#Rename the column ï..Pitcher_ID
df_cb <- df_cb %>%
rename(Pitcher_ID = ï..Pitcher_ID)
#Some data type conversion is needed
cols.factor <- c("Pitcher_ID","Pitcher_Throws", "Batter_Hits")
df_cb[cols.factor] <- sapply(df_cb[cols.factor],as.factor)
cols.num <- c("release_speed","x_movement", "z_movement", "release_spin_rate", "spin_dir", "release_pos_x", "release_pos_z", "release_extension", "plate_x", "plate_z")
df_cb[cols.num] <- sapply(df_cb[cols.num],as.numeric)
```
### EDA and Data Visualization
In Figure 1 and Figure 2, we will look at some distribution plots for the variables of interests.
```{r distribution plots}
#Distribution plots without splitting by the pitcher's handedness
plot_df <- df_cb %>%
pivot_longer(
cols = c(release_pos_z, plate_z, release_speed, release_spin_rate, z_movement, release_extension),
names_to = 'Variable',
values_to = 'value'
) %>% filter(!is.na(value))
ggplot(plot_df, aes(x = value, y = ..count..)) +
geom_histogram(bins = 30, color ='black', fill = 'lightblue') +
facet_wrap(vars(Variable), scales = 'free') +
ggtitle("Figure 1") +
theme_classic()
```
```{r distribution plots 2}
#Distribution plots splitting by the pitcher's handedness
plot_df2 <- df_cb %>%
pivot_longer(
cols = c(x_movement, plate_x, spin_dir, release_pos_x),
names_to = 'Variable',
values_to = 'value') %>%
filter(!is.na(value))
ggplot(plot_df2, aes(x = value, y = ..count..)) +
geom_histogram(bins = 30, fill = 'lightblue', color = 'black') +
facet_grid(cols = vars(Variable), rows = vars(Pitcher_Throws), scales = 'free') +
ggtitle("Figure 2") +
theme_classic()
```
The thing that stood out here is that standardizing horizontal release position(release_pos_x), spin direction(spin_dir) and horizontal movement(x_movement) for pitcher handedness(Pitcher_Throws) could be worth considering. However, since our dataset also includes batting side(Batter_Hits), I want to put both Pitcher_Throws and Batter_Hits into the logistic regression models later to take into account the platoon splits. Thus, I decided to not standardizing these three variables here. Keep in mind that the plate location plot in the whole project is from **pitcher's view**.
```{r scatterplot}
#Plate location for different pitch outcomes
ggplot(df_cb %>% filter(Pitch_Outcome %in% c('BallCalled', 'StrikeSwinging', 'StrikeCalled')),
aes(x= plate_x, y = plate_z)) +
geom_point(color ='black') +
facet_wrap(vars(Pitch_Outcome)) +
ggtitle("Figure 3: Plate Location from Pitcher's View") +
theme_classic()
```
From Figure 3 above, we can see that plate_z seems to be highly correlated with CB whiff rate. The whiffs were located mostly at the lower half of the strike zone or those lower and outside of the strike zone.
```{r pitch_outcome}
summary(as.factor(df_cb$Pitch_Outcome))
```
As from above, there are six types of pitch outcome in this dataset.
### Initial Logistic Regression Models
We will start with two logistic regression models with stepwise feature selection. The dependent variable of the first model will be a binary variable, the pitch outcome being StrikeSwinging(whiff) or the pitch outcome being one of BallCalled, FoulBall, InPlay or StrikeCalled. By this, we get to expect **if the pitch will generate a whiff based on the features given**. In Model 2, the dependent variable will also be a binary variable, the pitch outcome being StrikeSwinging(whiff) or the pitch outcome being one of Foul Ball or InPlay. In this second model, we get to expect **if the pitch will generate a whiff given that the batters swing at the pitch based on the features given**. Both models have their own intentions. Although traditionally swing and miss rate is defined as the pitches hitters swing and miss at divided by total pitches thrown by a pitcher(Model 2), here we will actually focus a bit more on Model 1 because Model 1 here has the higher accuracy as from below (0.89 in Model1 and 0.77 in Model2), and also intuitively, we could want our pitchers to throw a pitch that both gets the batter to swing and then miss.
**Logistic Regression Model 1**
```{r logistic regression model 1}
#Set the targeted outcome
df_train1 <- df_cb %>%
filter(Pitch_Outcome != 'HitByPitch') %>%
mutate(
SwingAndMiss = case_when(
Pitch_Outcome == 'StrikeSwinging' ~ 1,
T ~ 0)) %>%
drop_na()
#80/20 Train test split for test accuracy
set.seed(3456)
trainIndex <- createDataPartition(df_train1$SwingAndMiss, p = .8,
list = FALSE,
times = 1)
Train <- df_train1[ trainIndex,]
Test <- df_train1[-trainIndex,]
#Stepwise feature selection
full.model <- glm(SwingAndMiss ~ Pitcher_Throws + Batter_Hits +
release_speed + x_movement + z_movement +
release_spin_rate + spin_dir + release_pos_x + release_pos_z +
release_extension + plate_x + plate_z, data = Train, family = binomial)
step.model <- full.model %>% stepAIC(trace = FALSE)
#Summary of the model
summary(step.model)
#Test accuracy
Test$model_prob <- predict(step.model, Test, type = "response")
Test <- Test %>% mutate(model_pred = 1*(model_prob > .5) + 0)
Test <- Test %>% mutate(accurate = 1*(model_pred == SwingAndMiss))
print(paste0("Test Accuracy = ", sum(Test$accurate)/nrow(Test)))
```
- After stepwise feature selection, the variables that stood out were release_speed, x_movement, release_spin_rate, release_pos_z, plate_x and plate_z.
- The logistic regression coefficients give the change in the log odds of the outcome for a one unit increase in the predictor variable. For example, for a one unit increase in plate_z, the log odds of the pitch being a swinging strike decreases by 0.47.
- Vertical release position and vertical plate location both significantly, and negatively, affected the probability of a whiff. Just like what we saw from the exploratory data analysis, throwing at the lower part of the strike zone or even lower than the strike zone could generate more whiffs. And perhaps that can be done by adjusting the vertical release position.
- Release speed and spin rate can affect the probability of whiff positively, which is quite intuitive.
- From plate_x and x_movement, it seems like locating the CBs slightly toward left handed batter side could be more effective than toward right handed batter side.
**Logistic Regression Model 2**
```{r logistic regression model 2}
#Set the targeted outcome
df_train2 <- df_cb %>%
filter(Pitch_Outcome %in% c('StrikeSwinging', 'FoulBall', 'InPlay')) %>%
mutate(
SwingAndMiss = case_when(
Pitch_Outcome == 'StrikeSwinging' ~ 1,
T ~ 0)) %>%
drop_na()
#80/20 Train test split for test accuracy
set.seed(3456)
trainIndex <- createDataPartition(df_train2$SwingAndMiss, p = .8,
list = FALSE,
times = 1)
Train <- df_train2[ trainIndex,]
Test <- df_train2[-trainIndex,]
#Stepwise feature selection
full.model <- glm(SwingAndMiss ~ Pitcher_Throws + Batter_Hits +
release_speed + x_movement + z_movement +
release_spin_rate + spin_dir + release_pos_x + release_pos_z +
release_extension + plate_x + plate_z, data = Train, family = binomial)
step.model <- full.model %>% stepAIC(trace = FALSE)
#Summary of the model
summary(step.model)
#Test accuracy
Test$model_prob <- predict(step.model, Test, type = "response")
Test <- Test %>% mutate(model_pred = 1*(model_prob > .5) + 0)
Test <- Test %>% mutate(accurate = 1*(model_pred == SwingAndMiss))
print(paste0("Test Accuracy = ", sum(Test$accurate)/nrow(Test)))
```
In Model 2, although the accuracy was a bit lower(0.77) than in Model 1(0.89), it is still promising to find out that similar variables affected the probability of whiff significantly. Based on the test accuracy and as mentioned before, the fact that we could want our pitchers to throw a pitch that both gets the batter to swing and then miss, we will focus more on building different types of models to compare with Model 1 in the next steps.
### GAM Models
**GAM with Interaction Between Smoothed and Factor Variables**
The results of the baseline logistic regression models are okay, but there could still be non-linear relationships between the variables of interests and whiff rate. Some methods to tackle non-linear relationships include transforming linear terms (polynomial) and flexible modeling techniques like GAM (fitting smooth-ish data), Boosted Trees. Here I will focus on GAM. Since plate locations are the two main variables contributing to my logistic regression models, I'll focus on these two variables as well as examine interaction effects to determine whether the non-linear smoother s(plate_x) varies across different levels of Pitch and Bat Hand. There will be four levels(LL, LR, RL, RR) indicating the pitch and bat hand, respectively.
```{r gam initial model}
#GAM model
df_train1 <- df_train1 %>%
mutate(PitchAndBatSide = as.factor(paste0(Pitcher_Throws,',',Batter_Hits)))
gam_mod <- gam(SwingAndMiss ~ s(plate_x, by = PitchAndBatSide) + s(plate_z),
family = 'binomial',
data = df_train1)
summary(gam_mod)
```
This summary covers smooth terms. For smooths coefficients are not printed. This is because each smooth has several coefficients - one for each basis function. Instead, the first column reads edf, which stands for effective degrees of freedom. This value represents the complexity of the smooth. An edf of 1 is equivalent to a straight line. An edf of 2 is equivalent to a quadratic curve, and so on, with higher edfs describing more wiggly curves. Similar to what we saw from the previous models, the plate location (both horizontal and vertical) were significant in the model.
```{r gam plot}
plot(gam_mod)
```
Above are partial effect plots for the variables smooths. Similar to what we saw from the previous models, vertical location being down and/or outside of the strike zone could generate more swing and miss. As for horizontal location, after controlling for the pitch and bat handedness, it seems like throwing at the middle of the plate could be most effective, and slightly toward LHH is more effective than toward RHH.
Here's also a look at the summary of the fitted values of the model.
```{r gam model fitted values}
summary(gam_mod$fitted.values)
```
**Explore Validation**
To better understand the gam model with these variables, I came up with plots, as shown in Figure 4 below, showing the predicted probability of a whiff based on the plate location. For vertical location the outcome is basically the same. But for horizontal location, for left handed pitchers, in the top two plots, it seems like throwing to the outside of the batter would be more effective. As for the horizontal location for right handed pitchers, throwing to the left handed batter side could be more effective regardless of the actual batter handedness.
```{r pred_prob plot}
df_grid <- expand.grid(plate_z = seq(-2, 6, .1), plate_x = seq(-4,4,.1), PitchAndBatSide = c("L,L", "L,R", "R,L", "R,R")) %>%
add_predictions(gam_mod, var = 'pred_prob', type = 'response')
ggplot(df_grid, aes(x = plate_x, y = plate_z, fill = pred_prob, z = pred_prob)) +
geom_raster() +
geom_contour() +
facet_wrap(vars(PitchAndBatSide)) +
ggtitle("Figure 4") +
theme_classic()
```
**GAM with Interaction Between Smoothed Variables**
However, there's a decent chance that plate_x is related to plate_z in a way. Thus I'll create a smooth for the interaction of the two plate location variables, and then examine interaction effects to determine whether the non-linear smoother s(plate_x, plate_z) varies across different levels of Pitch and Bat Hand.
```{r interaction gam}
#GAM with interaction terms
gam_mod_xy <- gam(SwingAndMiss ~ s(plate_x, plate_z, by = PitchAndBatSide, k = 50),
family = 'binomial', discrete = T,
data = df_train1)
summary(gam_mod_xy)
```
The first 3D plot below illustrated a non-linear interaction effect between plate_x and plate_z.
```{r interaction gam plot}
#The plot for GAM with interaction terms
vis.gam(gam_mod_xy, view = c("plate_x", "plate_z"),
theta = 50, n.grid = 50, lwd = 0.4)
```
Another interesting finding here in Figure 5 is that after considering the interaction between horizontal and vertical location and splitting it by pitcher/batter handedness, for LHP vs. RHB specifically, throwing down and out is the most effective. And for the other three combinations, it's like what we saw from previous sections: vertically throwing at the lower half of the strikezone or out of the strike zone and horizontally throwing at the left handed batter side is more effective.
```{r interaction game 2}
df_grid <- expand.grid(plate_z = seq(-2, 6, .1), plate_x = seq(-4,4,.1), PitchAndBatSide = c("L,L", "L,R", "R,L", "R,R")) %>%
add_predictions(gam_mod_xy, var = 'pred_prob', type = 'response')
ggplot(df_grid, aes(x = plate_x, y = plate_z, fill = pred_prob, z = pred_prob)) +
geom_raster() +
geom_contour() +
coord_fixed() +
facet_wrap(vars(PitchAndBatSide)) +
ggtitle("Figure 5") +
theme_classic()
```
One thing though, is that the model with the interaction between s(plate_x) and s(plate_z) has a higher AIC, which means including this interaction does not improve our model’s performance. But still, it was close and reasonable to think that some non-linear interaction effect could actually be the real case in baseball and help with our prediction going forward.
```{r comparison between gam and gam_xy}
#Check for AIC to compare gam_mod with gam_mod_xy
AIC(gam_mod, gam_mod_xy)
```
**Calibration Plot**
In Figure 6 below I will bin the predicted probabilities from the **gam_mod_xy** into groups, check the number of instances in each group and calculate the actual swing and miss rate for each group, which is the concept of a calibration plot. A good model would have the dots close to the dashed line(slope = 1, intercept = 0), which is almost the case here. The dots are further away from the dashed line when the predicted/true probability goes higher. The reason behind this could be the small sample size there. Perhaps robust regression could help with this matter in future explorations. Also, the fact that the predicted probability tends to be higher than observed probability in low probability instances and lower in high probability instances tells us that there may be underfitting issues, which is why in later section we will fit more complex models like random forest model and xgboost model too.
```{r calibration}
#Calibration plot
df_cal <- df_train1 %>%
add_predictions(gam_mod_xy, var = 'pred_prob', type = 'response')
df_cal <- df_cal %>%
mutate(ProbBin = cut(df_cal$pred_prob, seq(0, 0.18, .01)))
mid_point <- seq(0.005, 0.175, 0.01)
cal_summary <- df_cal %>%
group_by(ProbBin) %>%
summarize(SM_Prob = mean(as.numeric(SwingAndMiss)),
n = n()) %>%
mutate(BinPoint = mid_point)
ggplot(cal_summary, aes(x = BinPoint, y = SM_Prob, size = n)) +
geom_point() +
xlab("Predicted Probability") +
ylab("True Probability") +
geom_abline(slope = 1, intercept = 0, linetype = 'dashed') +
ggtitle("Figure 6") +
theme_classic()
```
**SMAA and Split Half Correlation**
Below Figure 7 shows how much more or less swing and misses the pitcher generated compared to what the model predicted based on plate location and pitcher/batter handedness. SMAA means the swing and miss above average. Here we can see how this skill changes from 2018 to 2019 for the pitchers in the dataset.
```{r player SMAA}
df_player <- df_train1 %>%
add_predictions(gam_mod_xy, var = 'pred_prob', type = 'response') %>%
mutate(SMAboveAvg = SwingAndMiss - pred_prob)
df_summary <- df_player %>%
group_by(Pitcher_ID, Pitcher, Season) %>%
summarize(SMAA = sum(SMAboveAvg, na.rm = T),
n = n(),
SMAAPerPitch = SMAA/n)
df_split_half <- df_summary %>%
mutate(Yr_plus1 = Season + 1) %>%
inner_join(df_summary, by = c('Yr_plus1' = 'Season', 'Pitcher_ID', 'Pitcher'))
ggplot(df_split_half, aes(x= SMAAPerPitch.x, y = SMAAPerPitch.y)) +
geom_point(alpha = 0.2) +
xlab("SMAAPerCB in 2018") +
ylab("SMAAPerCB in 2019") +
geom_text_repel(size = 3, aes(label = Pitcher), force_pull = 0.1) +
ggtitle("Figure 7") +
theme_classic()
```
Then To check how reliable this SMAA (swing and miss above average) metric is, we will also do some split half correlations between the two seasons. However, the dataset here only has 2083 instances. Thus the split half correlations are not that intuitive or significant.
```{r split half correlation}
split_half_correlation <- function(df, n, metric, playerIdCol = "PlayerId", seed = 1){
set.seed(seed)
df_shc <- df %>%
dplyr::filter(is.na(!!as.name(metric)) == FALSE) %>%
group_by(!!as.name(playerIdCol), Season) %>%
mutate(num = n()) %>%
dplyr::filter(num >= n * 2)
if(nrow(df_shc) > 0){
df_shc <- df_shc %>%
sample_n(n * 2) %>%
arrange(!!as.name(playerIdCol), Season) %>%
mutate(group = c(rep(1, n), rep(2, n))) %>%
group_by(!!as.name(playerIdCol), Season, group) %>%
summarize(Metric = mean(!!as.name(metric))) %>%
ungroup() %>%
pivot_wider(id_cols = c(playerIdCol, 'Season'),
names_from = c('group'),
values_from = c('Metric')) %>%
summarize(cor_Metric = cor(`1`, `2`))
} else {
df_shc <- data.frame(cor_Metric = NA)
}
df_shc$Skill <- metric
df_shc$n <- n
return(df_shc)
}
```
```{r search for min sample size}
find_min_sample_sizes <- function(df_shc, benchmark = 0.7, min_acceptable = 0.5, print = TRUE){
min_sample_sizes <- df_shc %>%
filter(cor_Metric >= min_acceptable) %>%
group_by(Skill) %>%
mutate(maxCor = max(cor_Metric),
hasAboveThreshold = ifelse(maxCor >= benchmark, 1, 0),
isAboveThreshold = ifelse(cor_Metric >= benchmark, 1, 0)) %>%
filter(!(hasAboveThreshold == 1 & isAboveThreshold == 0)) %>%
group_by(Skill, hasAboveThreshold, isAboveThreshold) %>%
summarize(minN = min(n),
maxN = max(n)) %>%
ungroup() %>%
mutate(min_sample_size = ifelse(hasAboveThreshold == 0, maxN, minN)) %>%
dplyr::select(Skill, min_sample_size) %>%
tibble::column_to_rownames(var = 'Skill')
if(print == TRUE){
print(df_shc %>%
ggplot(aes(x=n, y=cor_Metric, color=Skill)) +
geom_line(size=1) +
geom_hline(yintercept = .7) +
theme_classic())
}
return(min_sample_sizes)
}
```
As from below, the correlation plot does look a bit weird. Normally it would go from zero and then gradually increases. But again the number of instances are probably just way too small in this project. It's still good exercise though.
```{r find min sample size}
# Iterate through metrics/split-halves
list_n <- seq(10, 80, 10)
list_shc <- list()
i <- 1
metric <- 'SMAboveAvg'
for(n in list_n) {
message(paste0(metric, ' - ', n))
tmp_shc <- split_half_correlation(df = df_player,
n = n,
metric = metric,
playerIdCol = "Pitcher_ID")
list_shc[[i]] <- tmp_shc
i <- i+1
}
# Remove data points where the SHC is too unstable
df_shc <- as.data.frame(data.table::rbindlist(list_shc))
# Find, visualize (for validation purposes)
min_sample_sizes <- find_min_sample_sizes(df_shc = df_shc,
benchmark = 0.7,
min_acceptable = 0.5)
```
### Random Forest and XGBoost Model
Then we will move on to building random forest and xgboost models, both using plate location and pitcher/batter handedness as independent variables to compare with the gam model with interaction terms. The reason I chose random forest is because its tree structure is ideal for capturing interactions between features in the data. Random Forest builds trees independently, and the results are aggregated into a single result at the end. The concept of collective intelligence here is another advantage. As for xgboost, the decision trees are built additively. Each tree is built one after another. It aggregates the results of each decision tree along the way to calculate the final result. It is less prone to overfitting because of its advanced regularization too.
```{r rf and xgboost}
#Rename the stepwise logistic regression model
init_mod <- step.model
df_train1 <- df_train1 %>%
mutate(SwingAndMiss_Factor = as.factor(SwingAndMiss))
#Build the random forest model
caret_rf <- caret::train(
SwingAndMiss_Factor ~ plate_x * plate_z + PitchAndBatSide,
data = df_train1,
method = 'rf',
family = 'binomial',
verbosity = 0
)
#Build the xgboost model
caret_xgb <- caret::train(
SwingAndMiss_Factor ~ plate_x * plate_z + PitchAndBatSide,
data = df_train1,
method = 'xgbTree',
family = 'binomial',
verbosity = 0
)
```
```{r rf and xgboost df}
df_grid <- expand.grid(plate_z = seq(-2, 6, .1), plate_x = seq(-3,3,.1), PitchAndBatSide = c("L,L", "L,R", "R,L", "R,R"))
# Random Forest prediction
pred_rf <- predict(caret_rf, newdata = df_grid, type = 'prob') %>%
rename(pred_prob = `1`)
df_grid_rf <- cbind(df_grid, pred_rf)
# XGB prediction
pred_xgb <- predict(caret_xgb, newdata = df_grid, type = 'prob') %>%
rename(pred_prob = `1`)
df_grid_xgb <- cbind(df_grid, pred_xgb)
# GAM prediction
df_grid_gam <- df_grid %>%
add_predictions(gam_mod_xy, var = 'pred_logit') %>%
add_predictions(gam_mod_xy, var = 'pred_prob', type = 'response')
```
From Figure 8, 9, 10 below you will be interested in the relationship between cb swing and miss prediction and plate locations by different handedness and models (GAM, Random Forest and xgboost, respectively). In Figure 11, you'll see the actual swing and miss rate based on plate location. Also in Figure 11, we could solidify our belief that plate_x and plate_z are indeed related. Generally from these plots, it's still **vertically throwing at the lower half of the strikezone or out of the strike zone** and **horizontally throwing at the left handed batter side** being more effective. Also, **LHP throwing further down and out againt RHB** could have surprising good effect in generating swings and misses.
```{r rf and xgboost plots}
ggplot(df_grid_gam, aes(x = plate_x, y = plate_z, fill = pred_prob, z = pred_prob)) +
geom_raster() +
geom_contour() +
coord_fixed() +
labs(title = 'GAM') +
facet_wrap(vars(PitchAndBatSide)) +
ggtitle("Figure 8") +
theme_classic()
```
One thing interesting to point out here is that the prediction from random forest and xgboost are both somewhat discrete. It could be because of their tree structure. Trees fail to deal with linear relationships. Any linear relationship between an input feature and the outcome has to be approximated by splits, creating a step function. This is not efficient. This goes hand in hand with lack of smoothness. Slight changes in the input feature can have a big impact on the predicted outcome, which is usually not desirable.
```{r rf plot}
ggplot(df_grid_rf, aes(x = plate_x, y = plate_z, fill = pred_prob, z = pred_prob)) +
geom_raster() +
geom_contour() +
coord_fixed() +
labs(title = 'Random Forest Model') +
facet_wrap(vars(PitchAndBatSide)) +
ggtitle("Figure 9") +
theme_classic()
```
```{r xgboost plot}
ggplot(df_grid_xgb, aes(x = plate_x, y = plate_z, fill = pred_prob, z = pred_prob)) +
geom_raster() +
geom_contour() +
coord_fixed() +
labs(title = 'XGBoost Tree') +
facet_wrap(vars(PitchAndBatSide)) +
ggtitle("Figure 10") +
theme_classic()
```
```{r actual swing and miss rate}
ggplot(df_train1 %>% filter(!is.na(plate_z), !is.na(plate_x)),
aes(x = plate_x, y = plate_z, z = SwingAndMiss)) +
stat_summary_2d() +
xlim(-3,3) +
ylim(-2,6) +
coord_fixed() +
scale_fill_gradient() +
labs(title = 'Data Summary') +
facet_wrap(vars(PitchAndBatSide)) +
ggtitle("Figure 11") +
theme_classic()
```
Then below you'll be interested in the estimates of the gam, random forest and xgboost models, respectively. These estimates can be used for calculating the true talent of a pitcher generating swing and miss with his curveballs compared to others throwing at the same location. Basically what we're doing here is that we're breaking down the prediction into components, one of which is true talent and the other is regression to the mean. So If we have a lot of CB thrown we can be much more confident that our prediction is a reflection of your true skill and rely less on regression to the mean, whereas if we don't have a lot of data we have to essentially add average data to get that prediction, which means it's heavily relying on regression to the mean. We will focus on the estimate in the Random Forest model since the p-value is lowest here. At the end of this section, there is a leaderboard for the true talent of this skill.
```{r rf and xgboost prediction}
df_pred <- df_train1 %>% filter(!is.na(plate_z), !is.na(plate_x))
# Random Forest Prediction
pred_rf <- predict(caret_rf, newdata = df_pred, type = 'prob') %>%
rename(prob_rf = `1`) %>%
dplyr::select(prob_rf)
df_pred <- cbind(df_pred, pred_rf)
# XGB Prediction
pred_xgb <- predict(caret_xgb, newdata = df_train1, type = 'prob') %>%
rename(prob_xgb = `1`) %>%
dplyr::select(prob_xgb)
df_pred <- cbind(df_pred, pred_xgb)
# GAM Prediction
df_pred <- df_pred %>%
add_predictions(gam_mod_xy, var = 'prob_gam', type = 'response')
```
```{r rf and xgboost SMAA}
#rf and xgboost SMAA
df_player <- df_pred %>%
group_by(Pitcher_ID, Pitcher, Season) %>%
summarize(
n = n(),
ActualSM = sum(SwingAndMiss),
expSM_gam = sum(prob_gam),
expSM_xgb = sum(prob_xgb),
expSM_rf = sum(prob_rf),
.groups = 'drop')
df_player <- df_player %>%
mutate(
SMAA_gam = ActualSM - expSM_gam,
SMAA_xgb = ActualSM - expSM_xgb,
SMAA_rf = ActualSM - expSM_rf
)
```
```{r rf and xgboost estimates}
set.seed(246)
#rf and xgboost estimates
df_player_sh <- df_pred %>%
mutate(
SMAA_gam = SwingAndMiss - prob_gam,
SMAA_xgb = SwingAndMiss - prob_xgb,
SMAA_rf = SwingAndMiss - prob_rf,
sh = rbinom(nrow(df_pred), 1, 0.5)
) %>%
filter(!is.na(SMAA_gam)) %>%
group_by(Pitcher_ID, Season, sh) %>%
summarize(
SMAA_gam = sum(SMAA_gam, na.rm=T),
SMAA_xgb = sum(SMAA_xgb, na.rm=T),
SMAA_rf = sum(SMAA_rf, na.rm=T),
n = n()
) %>%
mutate(other_sh = 1-sh) %>%
ungroup()
df_player_sh_train <- df_player_sh %>%
filter(sh ==0) %>%
filter(n > 50) %>%
inner_join(df_player_sh, by=c('Pitcher_ID', 'Season', 'sh'='other_sh'), suffix=c('.sh1','.sh2')) %>%
mutate(eff_wt = pmin(n.sh1, n.sh2))
regression_mod <- nls(
I(SMAA_gam.sh1 / n.sh1) ~ (SMAA_gam.sh2 + 0 * ballast)/(n.sh2 + ballast),
data = df_player_sh_train,
start = list(ballast=1000),
lower = list(ballast=10),
weights = eff_wt
)
summary(regression_mod)
regression_mod <- nls(
I(SMAA_rf.sh1 / n.sh1) ~ (SMAA_rf.sh2 + 0 * ballast)/(n.sh2 + ballast),
data = df_player_sh_train,
start = list(ballast=1000),
lower = list(ballast=10),
weights = eff_wt
)
summary(regression_mod)
regression_mod <- nls(
I(SMAA_xgb.sh1 / n.sh1) ~ (SMAA_xgb.sh2 + 0 * ballast)/(n.sh2 + ballast),
data = df_player_sh_train,
start = list(ballast=1000),
lower = list(ballast=10),
weights = eff_wt
)
summary(regression_mod)
```
```{r gam true talent}
#rf true talent
df_player %>%
mutate(
SM_Talent = SMAA_rf/(n+50),
Regression = 50/(50+n)) %>%
dplyr::select(Pitcher, Season, SM_Talent) %>%
arrange(-SM_Talent)
```
### Linear Mixed Models
In additional to pitch and bat side, individual pitcher could affect the models too. In this dataset, data could violate the assumption of being independent and identically distributed. Thus, here we will try a mixed model with Pitcher_Throws, Pitcher_ID and Batter_Hits being random effects, also plate location being fixed effects. From below, we can see that the variance for Pitcher_ID:Pitcher_Throws = 1.104e-01. It explains most of the variation from the random effects, which is the variance that’s “left over” after the variance explained by our fixed effects. This implies that who the pitcher is could affect the prediction, which lines up with why we came up with the true talent leaderboard. As for plate location, the estimates being negative still tells us the similar outcome as previous: Throwing at the lower part of the strike zone or even lower than the strike zone vertically and throwing close to the left handed batter side could generate more whiffs.
```{r lme}
glmer <- glmer(SwingAndMiss ~ plate_x*plate_z + (1 | Pitcher_Throws/Pitcher_ID) + (1 | Batter_Hits),
data = df_train1, family = binomial, nAGQ=1)
summary(glmer)
```
### Takeaways
1. From nearly all of the models, plate location contributes a lot to the prediction of a curveball generating swing and miss or not.
2. Generally, throwing at the lower part of the strike zone or even lower than the strike zone vertically and throwing close to the left handed batter side horizontally could generate more whiffs.
3. LHP throwing further down and out against RHB could have surprising good effect in generating swing and misses.
4. Who the pitcher is affects how strong this phenomenon is. We can see the true talent for this "skill" for the pitchers in the dataset in the leaderboard. However, the sample size is small here. More instances should give us more confidence in this.
5. A summary here about how we go from logistic regression to xgboost: We started with a simple and interpretable algorithm in logistic regression. The result was fine, and we got some valuable interpretation. However, there could be non-linear relationship. Thus, we went to GAM. We captured some valuable non-linear relationship and interaction. However, there are still some underfitting issues, which brought us to more complex models in random forest and xgboost. The prediction we got from these two complex models are a bit discrete, which is representative of their tree structure. However, the random forest model was still most powerful in helping us with the true talent calculations. Lastly, we used the mixed model to confirm our findings and justify the action of calculating the true talent of this skill. Hopefully with these models, we not only had some good interpretations, but also accurate predictions.
### Some Limitations and Potential Future Steps
- CBs could be less effective at generating swing and misses than say SLs, so maybe it is more important for some pitchers to be able to locate that pitch in the zone. Focusing on the swing and miss of curveballs could be misleading since in some cases pitchers probably aren't really trying to get a swing and miss, they're only trying to loop that thing in the zone for a called first strike.
- Add more instances to the dataset.
- Add more predictors to gam or mixed model and more explorations/visualizations.
- More All encompassing models (Random forest, XGB, LightGBM) and Interpretation.
- Global Model-Agnostic Methods
- The partial dependence plot: a feature effect method.
- Accumulated local effect plots: another feature effect method that works when features are dependent.
- Permutation feature importance: it measures the importance of a feature as an increase in loss when the feature is permuted.
- Local Model-Agnostic Methods
- Individual conditional expectation curves: they are the building blocks for partial dependence plots and describe how changing a feature changes the prediction.
- Local surrogate models (LIME): explain a prediction by replacing the complex model with a locally interpretable surrogate model.
- Counterfactual explanations: explain a prediction by examining which features would need to be changed to achieve a desired prediction.
- Shapley values: are an attribution method that fairly assigns the prediction to individual features.