-
Notifications
You must be signed in to change notification settings - Fork 0
/
report.Rmd
958 lines (740 loc) · 42.1 KB
/
report.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
---
title: "STAT-225 (Nonparametric Statistics) Project Report"
subtitle: "Predicting Diamond Price"
author: "Group 8: Nicole Frontero, Anna Ballou, Alex Russell"
date: "April 22, 2020"
output:
pdf_document:
html_document:
fig_height: 3
fig_width: 5
word_document:
fig_height: 3
fig_width: 5
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
options(warn = -1)
library(GGally)
library(tidyverse)
library(olsrr)
library(Rfit)
library(stats)
library(gam)
library(gridExtra)
library(kableExtra)
```
# Introduction and Exploratory Analyses
We aim to identify the best metric for predicting the price of a diamond. Diamonds vary in many ways from one another, for example, in various measurements of size, as well as in color and cut quality. Our question of interest is: Are certain qualities of diamonds better predictors for diamond price? In other words, are certain characteristics of diamonds more strongly associated with the price of a diamond than others?
The data is from the `diamonds` dataset from the `ggplot2` package in R. The data can be found on the official `ggplot2` webpage.\footnote{https://github.com/tidyverse/ggplot2/blob/master/data-raw/diamonds.csv} This dataset contains the prices and other attributes of nearly 54,000 diamonds. Each observation in this dataset represents a unique diamond. We randomly sampled 500 observations from the overall dataset.
```{r, include = F}
data(diamonds)
```
```{r, include = F}
set.seed(122)
data <- diamonds[sample(dim(diamonds)[1], 500), ]
head(data)
```
The observational units in our dataset are diamonds. Since we took a random sample from the larger dataset (`diamonds`), we would expect that our data is representative of the approximately 54,000 diamonds in `diamonds`. We are assuming that the observations in `diamonds` are representative of the world diamond population.
Below is brief introduction to our data:
* The response variable is price in US dollars, which ranges between \$326 - \$18,823.
* The explanatory variables are as follows:
- Carat: measure of mass (weight) of the diamond. Ranges between (0.2ct - 5.01ct). Note that 1 ct = 200 mg.
- length: length in mm. Ranges between (0mm - 58.9mm).
- width: width in mm. Ranges between (0mm - 58.9mm).
- depth: depth in mm. Ranges between (0mm - 31.8mm).
- depth: total depth percentage, which is calculated as $\frac{z}{\text{mean}(x,y)} = \frac{2z}{x+y}$, and ranges from 43% - 79%.
- table: the width of the top of the diamond relative to the widest point in mm. Ranges between 43mm - 95mm.
- cut: quality of the cut. Includes fair, good, very good, premium and ideal.
- clarity: a measurment of how clear the diamond is. Ranges between "I1" (worst) to "IF" (best).
- color: diamond color. Ranges between D (best) to J (worst.
```{r, include = F}
cat_data <- data %>%
mutate(length = x, width = y, depth_percent = depth, depth = z, price = log(price),
cut = as.character(cut), color = as.character(color),
clarity = as.character(clarity)) %>%
filter(depth != 0) %>%
select(-x, -y, -z)
```
When initially examining `price`, we noticed that it was heavily right skewed. As a result, we decided to log transform price (base $e$). A comparison of the non-transformed and transformed distribution of price can be seen in plots 1 and 2 below.
```{r, include = F}
#non transformed
non_transformed <- ggplot(data = data, aes(x = price)) +
geom_density(fill="blue") +
ggtitle("Distribution of price", subtitle = "gaussian kernel, nrd bandwidth") +
labs(caption = "plot 1")
# transformed
transformed <- ggplot(data = cat_data, aes(x = price)) +
geom_density(kernel = "rectangular", bw = "sj", fill="red") +
ggtitle("Distribution of log(price)", subtitle = "rectangular kernel, SJ bandwidth") +
labs(caption = "plot 2", x = "log(price)")
```
```{r, fig.height=4.5, fig.width=8, echo = F}
grid.arrange(non_transformed, transformed, ncol = 2)
```
We utilized `ggpairs()` to examine the quantitative variables in our data.
```{r, echo = F}
ggpairs(dplyr::select(cat_data, length, width, depth, carat, depth_percent, price))
```
As seen above, we noticed that `length`, `width` and `depth` appear to have a strong correlation with log(`price`) and all follow a very similar trend (upward sloping and linear). Additionally, these three variables are also strongly correlated with `carat.` `depth_percent` does not appear to be correlated with log(`price`) (Pearson's correlation of -0.0615). Lastly, `carat` appeared to have the most non-linear relationship with log(`price`).
Because of this observed non-linear relationship between `carat` and `price`, we hypothesized that the correlation reported in the `ggpairs()` output above underestimated the actual correlation. `ggpairs()` reports correlation using a Pearson Coefficient, which is ideal for linear relationships, but often underestimates non-linear correlation. Thus, we performed a Spearman's test for correlation on `carat` and `price`. Spearman's method is ideal for non-linear relationships. The results of our Spearman's correlation test (outlined below) give a correlation of 0.965 (which is larger than the reported correlation of 0.925).
**Hypotheses:**
$H_0: \rho = 0; H_A: \rho \neq 0$
Let $\alpha = 0.05$ and $\rho$ represent the correlation between length and price.
**Assumptions:**
We will assume that all pairs of observations are independent from each other.
**Test Results:**
$p \approx 0$, $\rho = 0.965$
```{r, include = F}
cor.test(x = cat_data$carat, y = cat_data$price, data = cat_data, method = "spearman")
```
**Conclusion:**
Because $p \approx 0 < \alpha = 0.05$ for the Spearman's's test for correlation, we reject our null hypotheses. Thus, carat is correlated with log(price).
# Methods
## OLS MLR
We aimed to build a more precise model that eliminated predictors that did not enhance predictive power. We used forward stepwise regression, which systematically adds variables to a model until the adjusted $R^2$ (i.e. how well the model fits data) fails to increase. We utilized forward stepwise linear regression to examine all possible predictors (carat, depth, table, length, width, color, clarity, and cut). The results of the forward stepwise regression are depicted in Table 1.
```{r, include = F}
cat_and_quant_lm <- lm(price ~ carat + depth + table + length + width + depth_percent
+ color + clarity + cut, data = cat_data)
stepwise_output <- ols_step_forward_p(cat_and_quant_lm)
stepwise_predictors <- stepwise_output$predictors
stepwise_adjr <- round(stepwise_output$adjr, 3) # rounding adjr
stepwise_aic <- round(stepwise_output$aic, 3)
steps <- c(1, 2, 3, 4, 5, 6, 7)
stepwise_df <- data.frame(cbind(steps, stepwise_predictors, stepwise_adjr, stepwise_aic))
colnames(stepwise_df) <- c("Step", "Predictors", "\\(R^2_\\text{adj.}\\)", "AIC")
table_stepwise <- knitr::kable(stepwise_df, format = "latex", booktabs = TRUE, escape = FALSE)%>%
kableExtra::kable_styling(font_size = 10, position = "center", row_label_position = "c", full_width = F) %>%
add_header_above(c("Table 1: Stepwise regression results" = 4))
```
```{r, echo = F}
table_stepwise
```
We noticed that forward stepwise regression excluded the `depth_percent` and `table` variables.
After stepwise regression, we built a multiple regression model to predict log(`price`) from the predictors included in forward stepwise regression. We exlcuded cut as it was the last categorical variable added and was largely explained by length and width. Each catgorical and quantitative variable with their respective coefficients and p-values can be seen below in Table 2.
```{r, include = F}
final_lm <- lm(price ~ width + clarity + color + carat + depth +
length, data = cat_data)
summary_final_lm <- summary(final_lm)
# get the coefficients
final_lm_coeffs <- data.frame(summary_final_lm$coefficients)
# get the names of the predictors
predictors_final_lm <- rownames(final_lm_coeffs)
# get the r-squared adjusted
final_lm_adjr <- summary_final_lm$adj.r.squared
# make the final df
final_lm_df <- cbind(predictors_final_lm, final_lm_coeffs)
final_lm_df <- select(final_lm_df, predictors_final_lm, Estimate, Pr...t..)
colnames(final_lm_df) <- c("Predictors", "Estimate", "P.value")
# final_lm_df <- mutate(final_lm_df, indicator = ifelse(P.value < 0.0001, 1, 0))
final_lm_df <- mutate(final_lm_df, indicator = ifelse(P.value < 0.0001, "< 0.0001", "6"))
final_lm_df[1,4] <- round(0.5915609, 5)
final_lm_df[2,4] <- round(0.0131907, 5)
final_lm_df[10,4] <- round(0.3061189, 5)
final_lm_df[11,4] <- round(0.0003385, 5)
final_lm_df <- final_lm_df %>% select(-P.value)
colnames(final_lm_df) <- c("Predictors", "Estimate", "P-value")
top_final_lm_df <- final_lm_df[0:9, ]
bottom_final_lm_df <- final_lm_df[10:18, ]
whole_df <- cbind(top_final_lm_df, bottom_final_lm_df)
# making table for kable
table_final_lm <- knitr::kable(whole_df, format = "latex", booktabs = TRUE, escape = FALSE) %>%
kableExtra::kable_styling(font_size = 10, position = "center", row_label_position = "l") %>%
add_header_above(c("log(price) ~ width + clarity + color + carat + depth + length" = 6)) %>%
add_header_above(c("Table 2: OLS model summary results" = 6)) %>%
column_spec(column = 3, border_right = TRUE)
```
```{r, echo = FALSE}
table_final_lm
```
Note: $R^2_\text{adj} = 0.983$
## Is OLS the best or should we use JHM (nonparametric approach)?
In order to decide if we were able to use an OLS model to predict log(`price`), we needed to examine the model's residuals for normality. Plot 3 below depicts the density of the OLS MLR model's residuals.
```{r, include = FALSE}
residuals <- final_lm$residuals
residuals <- as.data.frame(residuals)
colnames(residuals) <- c("residuals")
residual_model_plot <- ggplot(data = residuals, aes(x = residuals)) +
geom_density(fill = "orange") +
ggtitle("Distribution of OLS residuals") +
labs(caption = "plot 3")
```
```{r, echo = FALSE, fig.height=3.25, fig.width=3.25, fig.align="center"}
residual_model_plot
```
At first glance, the distribution looks relatively normal. We formally tested the residuals for normality using a Kolmogorov-Smirnov test. The details of the test are outline below:
**Hypotheses:**
$H_0:F(t)=F^{star}(t); H_A:F(t) \neq F^{star}(t)$ for at least one $t$, where $F^{star}(t)$ is the normal distribution and $F(t)$ is the observed distribution of price.
**Assumptions:**
Data come from a continuous distribution.
**Test:**
$p \approx 0$
```{r, warning = FALSE, include = FALSE}
ks.test(x = residuals$residuals, y = pnorm, alternative = c("two.sided"))
```
The empirical CDF of price compared to the normal CDF with observed mean and standard deviation is below (plot 4):
```{r, include = F}
norm_cdf <- data.frame(cbind(c(seq(from = min(residuals$residuals), to = max(residuals$residuals), by = 0.01)), pnorm(seq(from = min(residuals$residuals), to = max(residuals$residuals), by = 0.01), mean = mean(residuals$residuals), sd = sd(residuals$residuals))))
ecdf_plot <- ggplot() +
geom_point(inherit.aes = F, data = norm_cdf, aes(x = X1, y = X2), size = 1, color = "#F8766D") +
stat_ecdf(data = residuals, aes(x = residuals)) +
scale_x_continuous(name = "residuals") +
scale_y_continuous(name = "Empirical CDF") +
ggtitle("Empirical CDF of OLS residuals", subtitle = "red = normal CDF") +
labs(caption = "plot 4")
```
```{r, echo = FALSE, fig.height=3.25, fig.width=3.25, fig.align="center"}
ecdf_plot
```
**Conclusion:**
Because $p \approx 0 < \alpha = 0.05$, we reject our null hypotheses. We conclude that the distribution of the OLS residuals is not normal.
### JHM - best model
Since the OLS model residuals do not follow a normal distribution, we will create an JHM (rank based) regression model. Our JHM model included the same predictors as our OLS MLR. Each categorical and quantitative variable with their respective coefficients and p-values can be seen below in Table 3.
```{r, include = F}
rfit_final <- rfit(price ~ width + clarity + color + carat + depth
+ length, data = cat_data)
# data.frame(summary(rfit_final)$coefficients) %>% select(p.value)
summary_rfit_lm <- summary(rfit_final)
# get the coefficients
rfit_lm_coeffs <- data.frame(summary_rfit_lm$coefficients)
# get the names of the predictors
predictors_rfit_lm <- rownames(rfit_lm_coeffs)
# get the r-squared adjusted
rfit_lm_adjr <- summary_rfit_lm$R2
# make the final df
rfit_lm_df <- cbind(predictors_rfit_lm, rfit_lm_coeffs)
rfit_lm_df <- select(rfit_lm_df, predictors_rfit_lm, Estimate, p.value)
colnames(rfit_lm_df) <- c("Predictors", "Estimate", "P.value")
rfit_lm_df <- mutate(rfit_lm_df, indicator = ifelse(P.value < 0.0001, 1, 0))
rfit_lm_df <- mutate(rfit_lm_df, indicator = ifelse(P.value < 0.0001, "< 0.0001", "FALSE"))
rfit_lm_df[1,4] <- 0.93893
rfit_lm_df[2,4] <- 0.01848
rfit_lm_df[10,4] <- 0.29299
rfit_lm_df[11,4] <- 0.00054
rfit_lm_df <- rfit_lm_df %>% select(-P.value)
colnames(rfit_lm_df) <- c("Predictors", "Estimate", "P-value")
top_rfit_lm_df <- rfit_lm_df[0:9, ]
bottom_rfit_lm_df <- rfit_lm_df[10:18, ]
rfit_whole_df <- cbind(top_rfit_lm_df, bottom_rfit_lm_df)
# making table for kable
table_rfit_lm <- knitr::kable(rfit_whole_df, format = "latex", booktabs = TRUE, escape = FALSE) %>%
kableExtra::kable_styling(font_size = 10, position = "center", row_label_position = "l") %>%
add_header_above(c("log(price) ~ width + clarity + color + carat + depth + length" = 6)) %>%
add_header_above(c("Table 3: JHM model summary results" = 6)) %>%
column_spec(column = 3, border_right = TRUE)
```
```{r, echo = F}
table_rfit_lm
```
The plots below represent the JHM model fit for each quantitative variable. The red line presents the slope for that particular predictor in the presence of others. The blue dashed line indicates the mean of log(`price`). We noticed that all four variables follow a similar trend. This is potentially indicative of multicollinearity.
```{r, include = FALSE}
#rfit_final
mean_y = mean(cat_data$price)
mean_width = mean(cat_data$width)
mean_carat = mean(cat_data$carat)
mean_depth = mean(cat_data$depth)
mean_length = mean(cat_data$length)
coef_width = summary(rfit_final)$coefficients[2]
coef10_width = mean_y - coef_width*mean_width
coef2_width = summary(rfit_final)$coefficients[2]
coef20_width = mean_y - coef2_width*mean_carat
coef3_width = summary(rfit_final)$coefficients[2]
coef30_width = mean_y - coef3_width*mean_depth
coef4_width = summary(rfit_final)$coefficients[2]
coef40_width = mean_y - coef4_width*mean_length
rfit_plot1 <- ggplot(data = cat_data, aes(x = width, y = price)) +
geom_point() +
geom_abline(intercept = coef10_width, slope = coef_width, color = "red", size = 1) +
geom_hline(yintercept = mean_y, color = "blue", size = .5, lty = 2) +
ggtitle("log(price) vs. width") +
ylab("log(price)") +
labs(caption = "Plot 5a")
rfit_plot2 = ggplot(data = cat_data, aes(x = carat, y = price)) +
geom_point() +
geom_abline(intercept = coef20_width, slope = coef2_width, color = "red", size = 1) +
geom_hline(yintercept = mean_y, color = "blue", size = .5, lty = 2) +
ggtitle("log(price) vs. carat") +
ylab("log(price)") +
labs(caption = "Plot 5b")
rfit_plot3 <- ggplot(data = cat_data, aes(x = depth, y = price)) +
geom_point() +
geom_abline(intercept = coef30_width, slope = coef3_width, color = "red", size = 1) +
geom_hline(yintercept = mean_y, color = "blue", size = .5, lty = 2) +
ggtitle("log(price) vs. depth") +
ylab("log(price)") +
labs(caption = "Plot 5c")
rfit_plot4 <- ggplot(data = cat_data, aes(x = length, y = price)) +
geom_point() +
geom_abline(intercept = coef40_width, slope = coef4_width, color = "red", size = 1) +
geom_hline(yintercept = mean_y, color = "blue", size = .5, lty = 2) +
ggtitle("log(price) vs. length") +
ylab("log(price)") +
labs(caption = "Plot 5d")
```
```{r, echo = FALSE}
grid.arrange(rfit_plot1, rfit_plot2, rfit_plot3, rfit_plot4)
```
The plots below represent each categorical variable in the JHM model. The height of each point represents the coefficient ($\beta$ value) for that level of the categorical variable. The blue dashed line represents the baseline indicator level for each variable.
```{r, include = F}
#plotting categorical variables
#color
coefs_color <- rfit_final$coefficients[10:15]
coefs_color_df <- as.data.frame(coefs_color)
colnames(coefs_color_df) <- c("coefficients_vals")
color_options <- c("colorE", "colorF", "colorG", "colorH", "colorI", "colorJ")
color_options <- as.data.frame(color_options)
color_df <- cbind(color_options, coefs_color_df)
color_plot <- ggplot(data = color_df,
aes(x = color_options, y = coefficients_vals)) +
geom_point() +
ggtitle("Coefficients for color", subtitle = "dashed line = reference level") +
xlab("color") +
ylab("coefficient") +
ylim(-0.5, 0.1) +
geom_hline(yintercept = 0, lty = 2, color = "blue") +
theme(axis.text.x = element_text(angle = 90)) +
labs(caption = "Plot 6a")
#clarity plot
coefs_clarity <- rfit_final$coefficients[3:9]
coefs_clarity_df<- as.data.frame(coefs_clarity)
colnames(coefs_clarity_df) <- c("coefficients_vals_cl")
clarity_options <- c("clarityIF", "claritySI1", "claritySI2", "clarityVS1", "clarityVS2", "clarityVVS1", "carityVVS2")
clarity_options <- as.data.frame(clarity_options)
clarity_df <- cbind(clarity_options, coefs_clarity_df)
clarity_plot <- ggplot(data = clarity_df,
aes(x = clarity_options, y = coefs_clarity)) +
geom_point() +
ggtitle("Coefficients for clarity", subtitle = "dashed line = reference level") +
xlab("clarity") +
ylab("coefficient") +
ylim(-0.1, 1.1) +
geom_hline(yintercept = 0, lty = 2, color = "blue") +
theme(axis.text.x = element_text(angle = 90)) +
labs(caption = "Plot 6b")
```
```{r, echo=FALSE, fig.height=3.25, fig.width=7, fig.align="center"}
grid.arrange(color_plot, clarity_plot, ncol = 2)
```
## GAM
We were also interested in examining how well the same set of predictors estimate price in a Gereralized Additive Model (GAM). First, we examined the quantitative predictors (`length`, `width`, `depth` and `carat`) for any obvious relationships with log(`price`).
```{r, include = FALSE}
length_plot <- ggplot(data = cat_data, aes(x = length, y = price)) +
geom_point() +
ggtitle("log(price) vs. length") +
ylab("log(price)") +
labs(caption = "Plot 7a")
width_plot <- ggplot(data = cat_data, aes(x = width, y = price)) +
geom_point() +
ggtitle("log(price) vs. width") +
ylab("log(price)") +
labs(caption = "Plot 7b")
depth_plot <- ggplot(data = cat_data, aes(x = depth, y = price)) +
geom_point() +
ggtitle("log(price) vs. depth") +
ylab("log(price)") +
labs(caption = "Plot 7c")
carat_plot <- ggplot(data = cat_data, aes(x = carat, y = price)) +
geom_point() +
ggtitle("log(price) vs. carat") +
ylab("log(price)") +
labs(caption = "Plot 7d")
```
```{r, echo=FALSE}
grid.arrange(length_plot, width_plot, depth_plot, carat_plot, ncol = 2)
```
We noticed that length, width and depth all appear to follow a similar pattern. Because of this, we hypothesized that a smoother would most likely not be necessary for all three of these predictors. We also noted that `carat` is most likely co-linear with `length`, `width` and `depth`, so we noted that we may be able to explain carat's variability using other predictors.
Our first step in building the GAM incorporated comparing a SLR and a smoothing spline for each quantitative predictor. We utilized adjusted $R^2$ as a metric for model fit (calculated using the function below).
```{r}
#function for calculating adjusted r-squared for gam
gam_adjusted <- function(model){
rsq_gam = 1 - model$deviance/model$null.deviance
adjrsq_gam = 1 - (1 - rsq_gam)*(model$df.null/model$df.residual)
return(adjrsq_gam)
}
```
```{r, warning=F, echo = FALSE}
#WIDTH
smoothing_width <- gam(price ~ s(width, df = 6), data = cat_data)
slr_width <- lm(price ~ width, data = cat_data)
#gam_adjusted(smoothing_width)
#LENGTH
smoothing_length <- gam(price ~ s(length, df = 6), data = cat_data)
slr_length <- lm(price ~ length, data = cat_data)
#gam_adjusted(smoothing_length)
#CARAT
smoothing_carat <- gam(price ~ s(carat, df = 6), data = cat_data)
slr_carat <- lm(price ~ carat, data = cat_data)
#gam_adjusted(smoothing_carat)
#DEPTH
smoothing_depth <- gam(price ~ s(depth, df = 6), data = cat_data)
slr_depth <- lm(price ~ depth, data = cat_data)
#gam_adjusted(smoothing_depth)
```
A summary of the adjusted $R^2$ values for each model type can be seen in Table 4 below.
```{r, include = F}
adjrsquared_slr <- c(0.931, 0.929, 0.856, 0.870)
adjrsquared_gam <- c(0.943, 0.942, 0.941, 0.935)
predictors <- c("width", "length", "carat", "depth")
#create data table
decision_table <- cbind(predictors, adjrsquared_slr, adjrsquared_gam)
# colnames(decision_table) <- c("Predictor", "SLR", "Smooth")
decision_table <- knitr::kable(decision_table, "latex", booktabs = TRUE, escape = FALSE, col.names = c("Predictor", "SLR", "Smooth")) %>%
kableExtra::kable_styling(font_size = 10, position = "center", row_label_position = "l", full_width = F) %>%
# add_header_above(c("Table 4: SLR vs. smooth \\(R^2_\\text{adj}\\)" = 3))
add_header_above(c("Table 4: SLR vs. smooth $R^2_{\\\\text{adj}}$" = 3), escape = FALSE)
```
```{r, echo = FALSE}
decision_table
```
As seen in Table 4, for all 4 of the quantitative predictors, the $R^2_\text{adj}$ for the smoother was greater than the SLR.
We next built several potential GAMs for these 4 quantitative predictors. We included both categorical variables (clarity and color) for all models. The following is a summary and rationale for each model:
1. gam_full: a smoothing spline on a 4 quantitative predictors. We started with this GAM as our tests comparing smoothing to SLR suggested smoothing was as better metric for each variable.
2. gam_mod2: a smoothing spline on `carat` and `depth` and linear relationships for `length` and `width`. We chose to make `length` and `width` linear because a plot of their relationship showed a linear relationship with `log(price)`. We recongized that using smoothers on all 4 predictors in a GAM tends to lead to overfitting.
3. gam_mod3: a smoothing spline on `width` and linear relationship for `depth` and `carat.` We noticed that, in the presence of other predictors, `carat` was essentially linear. Switching `carat` from a smoothing spline to linear relationship would decrease the degrees of freedom and could improve the performance of our model. In this model, we also tried removing `length` as it seemed colinear with `width, depth` and `carat.`
4. gam_mod4: linear predictors on all variables. The general trend of the data appeared to be very close to linear for all predictors.
5. gam_mod5: smoothers on `length` and `width` and linear relationships for `depth` and `carat.` We decided to include all 4 quantitative variables in this model as removing one resulted in an increase in AIC.
```{r, include = F, warning = F}
#with all quantitative predictors
gam_full <- gam(price ~ clarity + color +
s(length, df = 6) +
s(width, df = 6) +
s(depth, df = 6) +
s(carat, df = 6),
data = cat_data)
gam_mod2 <- gam(price ~ clarity + color + length + width +
s(carat, df = 6) +
s(depth, df = 6),
data = cat_data)
gam_mod3 <- gam(price ~ clarity + color + depth + carat +
s(width, df = 6),
data = cat_data)
gam_mod4 <- gam(price ~ clarity + color + depth + carat + width,
data = cat_data)
gam_mod5 <- gam(price ~ clarity + color + depth + carat +
s(length, df = 6) +
s(width, df = 6),
data = cat_data)
AIC(gam_full, gam_mod2, gam_mod3, gam_mod4, gam_mod5)
```
A summary of each GAM's performance is summarized in table 5, below.
```{r, include = F}
aics <- c(-649.94, -612.54, -596.29, -545.24, -652.97)
modelNames <- c("color + clarity + s(length) + s(width) + s(depth) + s(carat)",
"color + clarity + length + width + s(depth) + s(carat)",
"color + clarity + depth + carat + s(width)",
"color + clarity + depth + carat + width + length",
"color + clarity + depth + carat + s(width) + s(length)")
#create df
choose_gam_table <- data.frame(cbind(modelNames, aics))
colnames(choose_gam_table) <- c("Model", "AIC")
choose_gam_table <- knitr::kable(choose_gam_table, "latex", booktabs = TRUE, escape = FALSE) %>%
kableExtra::kable_styling(font_size = 10, position = "center", row_label_position = "l", full_width = F) %>%
add_header_above(c("Table 5: Comparison of AIC between GAM models" = 2))
```
```{r, echo = F}
choose_gam_table
```
As seen in table 5, `gam_mod5` had the lowest AIC. Plots of this GAM's performance in comparison to a smoothing spine are below.
```{r, include = F}
#plot gam 5
gam_5_c = predict(gam_mod5, type = "terms")
gam_5_y = fitted(gam_mod5)
diamonds_subset = select(.data = cat_data, clarity, color, depth, carat, length, width, price)
gam_5_plots = cbind(diamonds_subset, gam_5_c, gam_5_y)
cnn = c(colnames(diamonds_subset),"clarity_pred", "color_pred", "depth_pred",
"carat_pred", "length_pred", "width_pred", "price_pred")
price_m = mean(diamonds_subset$price)
colnames(gam_5_plots) = cnn
#length plot
plotl5 <- ggplot(data = cat_data, aes(x = length, y = price)) +
geom_point() +
geom_hline(yintercept = price_m, linetype = 2, color = "blue") +
geom_smooth(color = "red", size = 1.5) +
geom_line(inherit.aes = F,
size = 1.5, color = "gold",
data = gam_5_plots,
aes(x = length, y = length_pred + price_m)) +
ggtitle("log(price) vs. length") +
ylab("log(price)") +
labs(caption = "Plot 8a")
#width plot
plotw5 <- ggplot(data = cat_data, aes(x = width, y = price)) +
geom_point() +
geom_hline(yintercept = price_m, linetype = 2, color = "blue") +
geom_smooth(color = "red", size = 1.5) +
geom_line(inherit.aes = F,
size = 1.5, color = "gold",
data = gam_5_plots,
aes(x = width, y = width_pred + price_m)) +
ggtitle("log(price) vs. width") +
ylab("log(price)") +
labs(caption = "Plot 8b")
#depth plot
plotd5 <- ggplot(data = cat_data, aes(x = depth, y = price)) +
geom_point() +
geom_hline(yintercept = price_m, linetype = 2, color = "blue") +
geom_smooth(color = "red", size = 1.5) +
geom_line(inherit.aes = F,
size = 1.5, color = "gold",
data = gam_5_plots,
aes(x = depth, y = depth_pred + price_m)) +
ggtitle("log(price) vs. depth") +
ylab("log(price)") +
labs(caption = "Plot 8c")
#carat plot
plotc5 <- ggplot(data = cat_data, aes(x = carat, y = price)) +
geom_point() +
geom_hline(yintercept = price_m, linetype = 2, color = "blue") +
geom_smooth(color = "red", size = 1.5) +
geom_line(inherit.aes = F,
size = 1.5, color = "gold",
data = gam_5_plots,
aes(x = carat, y = carat_pred + price_m)) +
ggtitle("log(price) vs. carat") +
ylab("log(price)") +
labs(caption = "Plot 8d")
#resids
plotr5 <- ggplot() +
geom_point(aes(x=gam_mod5$fitted.values, y=gam_mod5$residuals)) +
labs(x = "log(price)", y = "GAM Residuals") +
ggtitle("Residuals of gam_mod4") +
labs(caption = "Plot 8e")
```
```{r, echo = F, warning = F, comment = F, message = F}
grid.arrange(plotl5, plotw5, plotd5, plotc5, plotr5, ncol = 3)
```
### Explaining the roles of `carat` and `width` in the model
As seen in plots 8b and 8d, carat and width have very peculiar GAM lines. We hypothesized that these lines explain the variability and error introduced into the model by the other predictors. When examining `carat`, a plot of the residual error of the model with `depth`, the smoother on `width`, and the smoother on `length` against `carat` demonstrates that the GAM for carat directly explains the variability of the model caused by the 3 remaining quantitative variables. This is evident in plot 9a. Similarly, when examining `width`, a plot of the residual error of the model with `depth`, `carat`, and the smoother on `length` shows that the GAM for `width` also explains the variability of the model caused by the other 3 quantitative predictors. This is depicted on plot 9b.
```{r, include = F}
mod <- cat_data$price-(mean(cat_data$price) + predict(gam_mod5, type = "terms")[,5] + predict(gam_mod5, type = "terms")[,6] + 0.69298443*cat_data$depth)
model_resids_plot <- ggplot() +
geom_point(aes(x = cat_data$carat, y = mod)) +
geom_line(inherit.aes = F,
size = 1.5, color = "gold",
data = gam_5_plots,
aes(x = carat, y = carat_pred - (1/3)*price_m)) +
ggtitle("Error explained by carat") +
labs(caption = "Plot 9a") +
ylab("Y - Y-bar + B(depth) + s(width) + s(length)") +
xlab("carat")
```
```{r, include = F}
mod_w <- cat_data$price-(mean(cat_data$price) + predict(gam_mod5, type = "terms")[,5] + -0.25247515*cat_data$carat + 0.69298443*cat_data$depth)
model_resids_plot_w <- ggplot() +
geom_point(aes(x = cat_data$width, y = mod_w)) +
geom_line(inherit.aes = F,
size = 1.5, color = "gold",
data = gam_5_plots,
aes(x = width, y = width_pred - (1/3)*price_m)) +
ggtitle("Error explained by width") +
labs(caption = "plot 9b") +
ylab("Y - Y-bar + B(depth) + B(carat) + s(length)") +
xlab("carat")
```
```{r, echo = FALSE, fig.width=7, fig.height=4, fig.align="center"}
grid.arrange(model_resids_plot, model_resids_plot_w, ncol = 2)
```
# Discussion
## Proof of our models: examining residuals
```{r, include = F, warning=FALSE}
jhm_resids <- as.data.frame(rfit_final$residuals)
colnames(jhm_resids) <- c("resids")
y_bar_model <- lm(price ~ 1, data = cat_data)
y_bar_resids <- as.data.frame(y_bar_model$residuals)
colnames(y_bar_resids) <- c("resids")
gam_resids <- as.data.frame(gam_mod5$residuals)
colnames(gam_resids) <- c("resids")
ols_resids <- as.data.frame(final_lm$residuals)
colnames(ols_resids) <- c("resids")
#oneplot
all_resids_plot <- ggplot() +
stat_ecdf(inherit.aes = F,
data = jhm_resids, aes(x = resids, color = "blue"), geom = "step", position = "identity") +
stat_ecdf(inherit.aes = F,
data = y_bar_resids, aes(x = resids, color = "red"), geom = "step", position = "identity") +
stat_ecdf(inherit.aes = F,
data = gam_resids, aes(x = resids, color = "green"), geom = "step", position = "identity") +
stat_ecdf(inherit.aes = F,
data = ols_resids, aes(x = resids, color = "orange"), geom = "step", position = "identity") +
scale_x_continuous(name = "residuals") +
xlim(-0.5, 0.5) +
scale_y_continuous(name = "Empirical CDF") +
ggtitle("Comparing GAM, JHM, OLS & null-model residuals") +
labs(caption = "plot 10") +
scale_color_identity(name = "Model",
breaks = c("blue", "red", "green", "orange"),
labels = c("JHM", "Null", "GAM", "OLS"),
guide = "legend") +
xlab("residuals")
```
```{r, warning = FALSE, echo=FALSE, message=FALSE, comment=FALSE, fig.height=4, fig.width=5.5, fig.align="center"}
suppressWarnings(all_resids_plot)
```
As seen in the plot 10 above, the disribution of the GAM, OLS and JHM residuals are all very similar. The null is clearly performing the worst as the eCDF of the residuals is clearly not close to matching a normal CDF. The more vertical the eCDF, the more variability explained in the model (which is indicative of a more efffective model). Because we already found that using a parametric approach is not possible (the residuals are not normally distributed), we will focus primarily on the GAM and JHM.
As seen in plot 11, while both the JHM and GAM residuals follow a fairly normal distribution, the JHM is slighly less vertical than the GAM. This suggests that the GAM might be a slightly better model.
## Using cross validation to assess model fit
We also used a cross-validation appraoch to estimate the adjusted $R^2$, and L1-proportion for each of the our 3 models (OLS, JHM, GAM). Table 6 below depicts the $R^{2}$, $R^2_\text{adj}$ and L1-proportion on the original data as well as $R^{2}$, $R^2_\text{adj}$ and L1-proportion using cross validation.
```{r, include = F}
#Fit statistics for OLS
fit_ols = function(model) {
yy = model$residuals + model$fitted.values
rsq = 1 - sum(model$residuals^2)/sum((yy - mean(yy))^2)
nn = length(yy)
adjrsq = 1 - (1 - rsq)*((nn - 1)/(nn - length(model$coefficients)))
propL1 = 1 - sum(abs(model$residuals))/sum(abs(yy - mean(yy)))
return(cbind(rsq = rsq, adjrsq = adjrsq, propL1 = propL1))
}
#Fit statistics for JHM
fit_jhm = function(model) {
rsq = 1 - sum(model$residuals^2)/sum((model$y - mean(model$y))^2)
nn = length(model$y)
adjrsq = 1 - (1 - rsq)*((nn - 1)/(nn - length(model$coefficients)))
propL1 = 1 - sum(abs(model$residuals))/sum(abs(model$y - mean(model$y)))
return(cbind(rsq = rsq, adjrsq = adjrsq, propL1 = propL1))
}
#Fit statistics for GAM
fit_gam = function(model) {
rsq = 1 - model$deviance/model$null.deviance
adjrsq = 1 - (1 - rsq)*(model$df.null/model$df.residual)
propL1 = 1 - sum(abs(model$residuals))/sum(abs(model$y - mean(model$y)))
return(cbind(rsq = rsq, adjrsq = adjrsq, propL1 = propL1))
}
#General fit statistics
fit_gen = function(y, res, df){
rsq = 1 - sum(res^2)/sum((y - mean(y))^2)
nn = length(y)
adjrsq = 1 - (1 - rsq)*((nn - 1)/(nn - df))
propL1 = 1 - sum(abs(res))/sum(abs(y - mean(y)))
return(cbind(rsq = rsq, adjrsq = adjrsq, propL1 = propL1))
}
#My cross-validation function for this project
cv_rmc = function(dat, ols_mod, jhm_mod, gam_mod, k = 5, m = 10){
#(Some) error checking
if(class(ols_mod) != "lm") stop('ols_mod should come from the lm() function')
if(class(jhm_mod) != "rfit") stop('jhm_mod should come from the rfit() function')
if(class(gam_mod)[1] != "Gam") stop('gam_mod should come from the gam() function')
#Create model call character strings with subsetted data; uses stringr f()s
dat.name = paste0("data = ", deparse(substitute(dat)))
ols_call = capture.output(ols_mod$call)
ols_call = str_replace(ols_call, dat.name, "data = dat[-part[[i]], ]")
jhm_call = capture.output(jhm_mod$call)
jhm_call = str_replace(jhm_call, dat.name, "data = dat[-part[[i]], ]")
gam_call = paste(str_trim(capture.output(gam_mod$call)), sep="", collapse="")
gam_call = str_replace(gam_call, dat.name, "data = dat[-part[[i]], ]")
#Set up objects
ols_fit = matrix(nrow = m, ncol = 3)
jhm_fit = ols_fit; gam_fit = ols_fit
yy = jhm_mod$y
nn = dim(as.data.frame(dat))[1]
oos_lmres = vector(length = nn)
oos_jhres = oos_lmres; oos_gares = oos_lmres
df_ols = length(ols_mod$coefficients)
df_jhm = length(jhm_mod$coefficients)
df_gam = nn - gam_mod$df.residual
#Repeat k-fold cross-validation m times
for(j in 1:m) {
#Split data into k equal-ish parts, with random indices
part = suppressWarnings(split(sample(nn), 1:k))
#Execute model calls for all k folds; %*% is matrix multiplication
for(i in 1:k){
lm_mod = eval(parse(text = ols_call))
5
pred = predict(object = lm_mod, newdata = dat[part[[i]],])
oos_lmres[part[[i]]] = yy[part[[i]]] - pred
jh_mod = eval(parse(text = jhm_call))
subdat = select(.data = dat, colnames(jh_mod$x)[-1])[part[[i]],]
subdat = cbind(1, as.matrix.data.frame(subdat))
pred = subdat %*% jh_mod$coefficients
oos_jhres[part[[i]]] = yy[part[[i]]] - pred
ga_mod = eval(parse(text = gam_call))
pred = predict(object = ga_mod, newdata = dat[part[[i]],])
oos_gares[part[[i]]] = yy[part[[i]]] - pred
}
ols_fit[j, ] = fit_gen(y = yy, res = oos_lmres, df = df_ols)
jhm_fit[j, ] = fit_gen(y = yy, res = oos_jhres, df = df_jhm)
gam_fit[j, ] = fit_gen(y = yy, res = oos_gares, df = df_gam)
}
#Manage output -- average fit statistics
outtie = rbind(colMeans(ols_fit), colMeans(jhm_fit), colMeans(gam_fit))
colnames(outtie) = paste0("cv.", colnames(fit_ols(lm_mod)))
row.names(outtie) = c("OLS", "JHM", "GAM")
return(outtie)
}
```
```{r, echo = F}
fit_final = rbind(fit_ols(final_lm), fit_jhm(rfit_final), fit_gam(gam_mod4))
rownames(fit_final) = c("OLS", "JHM", "GAM")
```
```{r, include = F}
#create temp dataset to pass into dat
cat_data_temp <- cat_data %>%
mutate(clarityIF = ifelse(clarity == "IF", 1, 0),
claritySI1 = ifelse(clarity == "SI1", 1, 0),
claritySI2 = ifelse(clarity == "SI2", 1, 0),
clarityVS1 = ifelse(clarity == "VS1", 1, 0),
clarityVS2 = ifelse(clarity == "VS2", 1, 0),
clarityVVS1 = ifelse(clarity == "VVS1", 1, 0),
clarityVVS2 = ifelse(clarity == "VVS2", 1, 0)) %>%
mutate(colorE = ifelse(color == "E", 1, 0),
colorF = ifelse(color == "F", 1, 0),
colorG = ifelse(color == "G", 1, 0),
colorH = ifelse(color == "H", 1, 0),
colorI = ifelse(color == "I", 1, 0),
colorJ = ifelse(color == "J", 1, 0)) %>%
select(-color, -clarity)
#build new jhm model using temp dataset
rfit_final_temp <- rfit(price ~ width + carat + depth + length + clarityIF +
claritySI1 + claritySI2 + clarityVS1 + clarityVS2 +
clarityVVS1 + clarityVVS2 + colorE + colorF + colorG
+ colorH + colorI + colorJ,
data = cat_data_temp)
ols_final_temp <- lm(price ~ width + carat + depth + length + clarityIF +
claritySI1 + claritySI2 + clarityVS1 + clarityVS2 +
clarityVVS1 + clarityVVS2 + colorE + colorF + colorG
+ colorH + colorI + colorJ,
data = cat_data_temp)
gam_mod5_temp <- gam_mod5 <- gam(price ~ clarityIF + claritySI1 + claritySI2 +
clarityVS1 + clarityVS2 +
clarityVVS1 + clarityVVS2 +
colorE + colorF + colorG
+ colorH + colorI + colorJ +
depth + carat +
s(length, df = 6) +
s(width, df = 6),
data = cat_data_temp)
```
```{r, include = F, warning=F}
out10 <- cv_rmc(dat = cat_data_temp, ols_mod = ols_final_temp, jhm_mod = rfit_final_temp, gam_mod = gam_mod5_temp)
kable(round(out10,4)) %>% kable_styling(position = "center")
non_cv_res <- fit_final
cv_res <- out10
model_fit_df <- cbind(non_cv_res, cv_res)
model_fit_df <- round(model_fit_df, 4)
types <- c("OLS", "JHM", "GAM")
model_fit_df <- cbind(types, model_fit_df)
model_fit_df <- data.frame(model_fit_df)
model_fit_df <- model_fit_df[ , -1]
colnames(model_fit_df) <- c("\\(R^2\\)",
"\\(R^2_\\text{adj}\\)",
"\\(L1_{\\text{prop}}\\)",
"\\(R^2\\)",
"\\(R^2_\\text{adj}\\)",
"\\(L1_{\\text{prop}}\\)")
# making table for kable
table_model_fit <- knitr::kable(model_fit_df, format = "latex", booktabs = TRUE, escape = FALSE) %>%
kableExtra::kable_styling(font_size = 10, position = "center", row_label_position = "c", full_width = FALSE) %>%
add_header_above(c("Regular approach" = 4, "Cross-validation approach" = 3)) %>%
add_header_above(c("Table 6: Results from cross-validation" = 7)) %>%
column_spec(column = 4, border_right = TRUE)
```
```{r, echo = F}
table_model_fit
```
Because the 3 models were generated from the same dataaset on which the 3 measures of fit ($R^{2}$, $R^2_\text{adj}$ and L1-proportion) were calculated, the non-cross validation approach is potentially overfit to the data and the values are not as accurate. Cross validation, however, attempts to guard against overfitting. Thus, the values of the right side of Table 7 are how we will compare models.
All 3 models have relatively similar $R^2_\text{adj}$ values. The GAM, however, has an $R^2_\text{adj}$ of 0.9838, which is slightly higher than the OLS and JHM. In examining the $R^2$ values, OLS fails to explain 1.84% of the variability, while GAM fails to explain 1.53% of the variability. Thus, using the GAM model over OLS results in a 16% decrease in unexplained variability. It is important to note that we cannot fairly use $R^2$ for JHM because of its ability to "ignore" outliers. The L1 proportion, however, denotes the absolute value of the residuals over the absolute value of the deviation of the mean (i.e. proportion of error explained by the model). Notably, the L1 proportion value is also higher for the GAM. Based on our cross-validation results, the GAM is the best-performing model.
# Limitations and Challenges
There are two main limitations in our work. Our main statistical concern was the high level of multi-colinearity between `carat`, `length`, `width`, and `depth`. This overlap between predictors made it difficult to create a model that didn't incorporate variables that explained the same variation in log(`price`). This was most evident in creating our GAM. We were able to include SLR fits on two of these inter-related variables as a smoother was really only necessary on one - it explained the slight non-linear variability of the 3 remaining co-linear variables.
Our second concern arose from the dataset itself. We did not know what year (or group of years) the diamonds from within the dataset arose. Because of this, we will not be adjust for inflation when predicting diamond price in alternative years.
Overall, the main challenge we faced was dealing with the multicolinearity between our 4 main quantitative predictors. We found it difficult to create models using our standard techniques as the results of our models were almost always riddled with impacts of multicollinearity. When building our GAM, for example, while we knew `length, width, depth and carat` were likely co-linear, removing any of them from the model resulted in an increase in AIC (indicating we couldn't remove them). Thus, our main challenge was finding ways to balance between using multicollinear variables and creating highly predictive models.
# Conclusion
We suggest using a generalized additive model to predicting diamond price. We cannot use a parametric appraoch (i.e. OLS MLR) because the residuals of such a model are not normally distributed (as seen in the K-S test). The JHM model had a lower L1-proportion (from cross validation) when compared to the GAM. Because of this, we propose using a GAM to predict log(`price`) from d`epth, length, width, carat, color, and clarity` (with a smoothing spline on length and width).