-
Notifications
You must be signed in to change notification settings - Fork 0
/
MachineLearningRSupervisedRegression.Rmd
298 lines (199 loc) · 7.81 KB
/
MachineLearningRSupervisedRegression.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
---
title: "Machine Learning in R"
author: "Mike Jadoo"
date: "2024-01-07"
output: html_document
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
## Machine Learning
Machine learning is a branch of artificial intelligence (AI) and computer science which focuses on the use of data and algorithms to imitate the way that humans learn.
Why is this important: It helps improving accuracy.
Load packages
dplyr- for data wrangling
ggplot2 and corrgram for visualization
caTools -versatile tool that allows users to perform moving window statistics for data analysis.
caret - powerful train function that allows you to fit over 230 different models using one syntax.
car -provides functions for performing linear regression analysis.
```{r load_pk}
#for Data wrangling
library(dplyr)
#for data viz
library(ggplot2)
library(corrgram)
library(gridExtra)
library(ggpubr)
#for Machine Learning
library(caTools)
library(caret)
library(car)
library(lmtest)
library(reshape)
options(max.print=999999)
```
## import data
You can also embed plots, for example:
```{r data, echo=FALSE}
df <- read.csv('F:/YOUR DIRECTORY/Fish.csv')
head(df)
#structure of data
str(df)
```
## Exploratory Data Analysis - view the data summary statistics
```{r stats, echo=TRUE}
summary(df)
```
#check for missing observations
```{r miss, echo=TRUE}
# count total missing values
print("Count of total missing values ")
sum(is.na(df))
print("Which column has missing values ")
colSums(is.na(df))
```
```{r plot, echo=TRUE}
dfm1<-df%>%select(-Species) # more concise
set.seed(100)
boxplot(dfm1)
# facet_wrap to avoid code duplication
facet_plot_vars<-function(df, title_text) {
df_melted<-melt(df, id.vars="Weight")
ggplot(data=df_melted, aes(value, Weight)) +
geom_point(stat="identity") +
facet_wrap(~variable, scales="free") +
labs(x="value",
title=title_text)
}
facet_plot_vars(dfm1, "Weight vs. all variables")
```
##Cross Validation
Running training on different splits of the data can be translated into validations, for example validating that the performance output of each model is within a given range, ensuring that the model performs well.
Types of validations:
Random Split
Time-Based Split
K-Fold Cross-Validation
We are using Random split (similar to bootstrapping) to randomly sample a percentage of data into training and testing sets. The advantage of this method is that there is a good chance that the original population is well represented in all two sets. the hope is that random splitting will prevent a biased sampling of data.
```{r cross_val, echo=TRUE}
#create the training data, test data 80/20
sampleSplit <- sample.split(Y=dfm1$Weight, SplitRatio=0.8)
trainSet <- subset(x=dfm1, sampleSplit==TRUE)
testSet <- subset(x=dfm1, sampleSplit==FALSE)
#train the data
model1 <- lm(formula=Weight ~ ., data=trainSet)
#view results
summary(model1)
modelResiduals1 <- as.data.frame(residuals(model1))
plot(model1, which=1, col=c("blue"))
#qqnorm(model1$residuals)
qqPlot(model1$residuals)
```
```{r train, echo=TRUE}
#using R2 of the training and test set to check for overfitting
R2_training1<- cor(model1$fitted.values, trainSet$Weight)^2
#R2_training1
model_test1 <- lm(formula=Weight ~ ., data=testSet)
#summary(model_test1)
R2_test1<- cor(model_test1$fitted.values, testSet$Weight)^2
#R2_test1
message('Checking for Overfitting','\n','Training data R2: ',format(R2_training1 , digits = 3) , ' Test data R2: ',format(R2_test1, digits = 3))
```
##Model evaulation measures
```{r evluation, echo=TRUE}
#Mean Squared Error (MSE): The average squared difference between the predicted #and actual values of the target variable.
res <- model1$residuals
mse<-mean(res**2)
#mse
#Root Mean Squared Error (RMSE): The square root of the mean squared error.
# For convenience put the residuals in the variable res
res <- model1$residuals
# Calculate RMSE, assign it to the variable rmse and print it
rmse <- sqrt(mean(res**2))
#R2 – Score
R2_<- cor(model1$fitted.values, trainSet$Weight)^2
#R2_
message('Model 1 evaulation measures','\n','R2: ',format(R2_ , digits = 3) , ' RMSE: ',format(rmse, digits = 3), ' MSE: ',format(mse, digits = 6) )
```
```{r chk_assmp, echo=TRUE}
#check for linear model assumptions
#normality test
shapiro.test(model1$residuals)
#homoscedasticity test -assumption of equal or similar variances [we want to reject]
bptest(model1)
cat("Multicolinearity test\n")
#Multicolinearity test [indpendent, predictor variable have strong relationship]
vif(model1)
```
##Research tip
our model has variables that are not significant, some of our assumptions
for linear model failed. Why?
What is Length 1 and 2 represent, lets look at the slides.
##Tuning the model
##New model
```{r nm, echo=TRUE}
#only using significant variables
dfm2<-df%>%select(Height, Width,Length1,Weight)
set.seed(100)
# facet_wrap to avoid code duplication
facet_plot_vars(dfm2, "Weight vs. new model's variables")
```
```{r train, echo=TRUE}
#create the training data, test data 80/20
sampleSplitm2 <- sample.split(Y=dfm2$Weight, SplitRatio=0.8)
trainSetm2 <- subset(x=dfm2, sampleSplitm2==TRUE)
testSetm2 <- subset(x=dfm2, sampleSplitm2==FALSE)
#train the data
model2 <- lm(formula=Weight ~ ., data=trainSetm2)
summary(model2)
```
```{r, echo=TRUE}
modelResiduals2 <- as.data.frame(residuals(model2))
plot(model2, which=1, col=c("blue"))
#qqnorm(model1$residuals)
qqPlot(model2$residuals)
```
##Check for overfitting
Overfitting happens due to several reasons, such as:
• The training data size is too small and does not contain enough data samples to accurately represent all possible input data values.
• The training data contains large amounts of irrelevant information, called noisy data.
• The model trains for too long on a single sample set of data.
• The model complexity is high, so it learns the noise within the training data.
```{r}
#using R2 of the training and test set to check for overfitting
R2_training<- cor(model2$fitted.values, trainSetm2$Weight)^2
#R2_training
model_test2 <- lm(formula=Weight ~ ., data=testSetm2)
#summary(model_test2)
R2_test2<- cor(model_test2$fitted.values, testSetm2$Weight)^2
#R2_test2
message('Checking for Overfitting for second model','\n','Training data R2: ',format(R2_training , digits = 3) , ' Test data R2: ',format(R2_test2, digits = 3))
```
##The training dataset performance is lower compared to the test dataset- The model
##isn't overfited.
##Model evaulation measures
```{r evluation2, echo=TRUE}
#calculate Evaluation measures
#Mean Squared Error (MSE): The average squared difference between the predicted #and actual values of the target variable.
res <- model_test2$residuals
mse2<-mean(res**2)
#mse2
#Root Mean Squared Error (RMSE): The square root of the mean squared error.
# For convenience put the residuals in the variable res
res <- model_test2$residuals
# Calculate RMSE, assign it to the variable rmse and print it
rmse2 <- sqrt(mean(res**2))
#R2 – Score
R2_test2<- cor(model_test2$fitted.values, testSetm2$Weight)^2
# R2_test2
message('Model 2 evaulation measures','\n','R2: ',format(R2_test2 , digits = 3) , ' RMSE: ',format(rmse2, digits = 3), ' MSE: ',format(mse2, digits = 6) )
```
```{r ckassmp, echo=TRUE}
#check for linear model assumptions
#normality test
shapiro.test(model_test2$residuals)
#homoscedasticity test -assumption of equal or similar variances [we want to reject]
bptest(model_test2)
cat("Multicolinearity test\n")
#Multicolinearity test [indpendent, predictor variable have strong relationship]
vif(model_test2)
```