-
Notifications
You must be signed in to change notification settings - Fork 0
/
Linear Regression.Rmd
199 lines (152 loc) · 6.33 KB
/
Linear Regression.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
---
title: 'Linear Regression'
author: 'Victoria Okereke'
output:
pdf_document: default
html_document:
df_print: paged
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
```{r}
#importing libraries
library(faraway)
library(visdat)
library(olsrr)
library(lmtest)
library(caret)
library(kernlab)
library(ipred)
#setting seed
set.seed(123)
```
Aim: To predict wage from the usawage dataset in the faraway library
Data Exploration
```{r}
#reading in dataset
data("uswages")
#viewing data structure
str(uswages)
#viewing first 6 rows of data
head(uswages)
#viewing the pattern of missingness
vis_miss(uswages)
```
No missing data so we do not need to worry about missingness.
A careful review of the data shows that columns ne, mw, so, and we seem to have been coded from the same categorical variable so we will drop one of them from the model
```{r}
#dropping the 'we' variable
uswages_reduced = uswages[-c(9)]
#fitting the linear regression model
uswages_reg = lm(wage ~., data = uswages_reduced)
#getting a summary statistics
summary(uswages_reg)
```
Some variables are not significant. Let's use a variable selection method to retain only significant variables in the model
```{r}
#performing a stepwise both ways variable selection
kstep_both = ols_step_both_p(uswages_reg,pent=0.1,prem=0.05)#,details = TRUE)
kstep_both
#fitting the selected model
uswages_reg_final = lm(wage~ educ + exper + pt + smsa + race,data = uswages_reduced)
summary(uswages_reg_final)
```
From the summary statistics above, we see that all variables in the model are now significant. We also notice a low R-squared value of 0.1977 and Adjusted R-squared of 0.1957
Regression function:
yhat = -243.4879 + 48.6616(educ) + 9.0798(exper) - 336.9503(pt) + 115.5466(smsa) - 124.9292(race)
Now let's check to see if all the linear regression assumptions are met.
Checking for Multicollinearity
```{r}
vif(uswages_reg_final)
```
All VIFs are below 10. There is no multicolinearity in the data
Checking for normality assumption
```{r}
#Obtaining the standardized residuals
stdres = rstandard(uswages_reg_final)
#Normal probability plot of the standardized residuals
qqnorm(stdres)
qqline(stdres)
```
The QQ plot above shows a heavy upper tail. Which means that the model could be violating the normality assumption. Let's confirm through the Shapiro-Wilk test
```{r}
shapiro.test(uswages_reg_final$residuals)
```
Ho: residuals are normally distributed
Ha: residuals are not normally distributed
The p-value < 2.2e-16, which signifies that we should reject the null hypothesis and conclude that the residuals are not normally distributed. This confirms that the model failed the normality assumption
Let's check for constant variance
```{r}
#obtaining the residual
ei = uswages_reg_final$residuals
Y_hat = uswages_reg_final$fitted.values
#scatter plot of the residuals against fitted values Y
plot(Y_hat,ei,main = "Residuals vs. Fitted Values Y",
xlab = "Fitted Values Y",ylab = "Residuals")
```
The plot above shows that the error term is not constant. We also notice some outliers. The plot also shows that the relationship is linear
```{r}
#conducting Brausch-Pagan test to confirm
bptest(uswages_reg_final, studentize = FALSE)
```
Ho: Error variance is constant
Ha: Error variance is not constant
From the results above, we see that the p-value (< 2.2e-16) is significant (i.e. less than 0.05). So we reject the null hypothesis and conclude that error variance is not constant. Therefore the model also violates the constant variance assumption.
Let's investigate the outliers
```{r}
#Checking for outliers
ols_plot_resid_stud_fit(uswages_reg_final)
ols_plot_resid_stud(uswages_reg_final)
```
From the plots above, we notice a lot of outlying observations.
Let's try to improve the R-square of our model by transforming the data. To determine type of transformatio needed, we use Box-Cox
```{r}
library(MASS)
par(mfrow=c(1,1))
boxcox(uswages_reg_final,lambda=seq(-1,1,by=.1))
```
The Box Cox suggest lambda close to zero, which means a log transformation of the outcome variable.
```{r}
#fitting the model with a log-scale of the response variable
uswages_reg_trans = lm(log(wage) ~., data = uswages_reduced)
#performing a stepwise both ways variable selection
kstep_both_trans = ols_step_both_p(uswages_reg_trans,pent=0.1,prem=0.05)#,details = TRUE)
kstep_both_trans
#refitting the selected model
uswages_reg_trans_final = lm(log(wage)~ educ + exper + pt + smsa + race,data = uswages_reduced)
summary(uswages_reg_trans_final)
```
Regression function:
log(yhat) = 4.725711 + 0.086566(educ) + 0.016037(exper) - 1.098583(pt) + 0.174543(smsa) - 0.211327(race)
Comparing the output from the transformed and untransformed model, we see that after transforming the response variable, Adjusted R-squared increased greatly from 0.1957 to 0.3827. We also see from the plot of fitted values vs residuals below that the plot looks more random compared to the previous plot from the untransformed model.
```{r}
plot(uswages_reg_trans_final$fitted.values,uswages_reg_trans_final$residuals)
```
Note that we could check to see if the outliers are still present, we could fit a robust regression model since it is robust to outliers (robust regression methods to consider are huber and bi-square regression)
Finally, from our regression function, we can make predictions. For instance, what would be the expected wage of an individual who has 15 educ, 30 exper, pt 0, smsa 1, and race 1.
From our regression function:
log(yhat) = 4.725711 + 0.086566(educ) + 0.016037(exper) - 1.098583(pt) + 0.174543(smsa) - 0.211327(race)
```{r}
yhat = exp(4.725711 + (0.086566*15) + (0.016037*30) - (1.098583*0) + (0.174543*1) - (0.211327*1))
print(yhat)
```
Our model predicts a wage of 644.5336
Let's use the predict function in R
```{r}
#create a dataframe with the new observation
data = data.frame(educ=15,exper=30,pt=0,smsa=1,race=1)
#predict confidence interval
yh = predict(uswages_reg_trans_final,data,se.fit=TRUE, interval = "confidence",
level = 0.95)
#taking the exp of the fit
fit_yh = exp(c(yh$fit[,1]))
#obtaining the lower limits
lower <- exp(c(yh$fit[,2]))
#obtaining the upper limits
upper <- exp(c(yh$fit[,3]))
print(fit_yh)
print(lower)
print(upper)
```
Our result is the same. Wage is predicted to be 644.5374 with 95% confidence interval of (584.5296,710.7056), which does not include zero, which means it is significant.