-
Notifications
You must be signed in to change notification settings - Fork 0
/
redwine.Rmd
616 lines (340 loc) · 27.8 KB
/
redwine.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
Red wine analysis by sujay
========================================================
```{r echo=FALSE, message=FALSE, warning=FALSE, packages}
#install.packages('gridExtra')
#install.packages("ggplot2", dependencies = T)
#install.packages("knitr", dependencies = T)
#install.packages("dplyr", dependencies = T)
#install.packages('GGally')
#install.packages('RColorBrewer', dependencies = TRUE)
#install.packages('scales')
#install.packages('memisc')
library(scales)
library(memisc)
library(RColorBrewer)
library(gridExtra)
library(GGally)
library(knitr)
library(ggplot2)
```
```{r echo=FALSE, Load_the_Data}
pf <- read.csv('wineQualityReds.csv')
```
# introduction
This project will apply exploratory data analysis techniques using R to a dataset about chemical properties of Red Wines in order to answer the following question: "Which chemical properties influence the quality of red wines?" The dataset to be explored contains information about 1,599 red wines with 11 variables on the chemical properties of the wine. The quality of each wine is rated between 3(bad) and 8(good) by at least 3 wine experts.
# Univariate Plots
## summary of data
```{r echo=FALSE, Univariate_Plots}
dim(pf)
```
This data consist of 1599 rows and 13 columns.
```{r echo=FALSE, Univariate_Plots1}
names(pf)
```
Names of 13 columns.
```{r echo=FALSE, Univariate_Plots2}
str(pf)
```
```{r echo=FALSE, Univariate_Plots3}
summary(pf)
```
summary of all the variables
## Observations from the Summary
1.There is a big range for sulfur.dioxide (both Free and Total) across the samples.
2.The sample consists of 1599 Red Wine .
3.The alcohol content varies from 8.00 to 14.90 for the samples in dataset.
4.The quality of the samples range from 3 to 8 with 6 being the median.
5.The range for fixed acidity is quite high with minimum being 4.60 and maximum being 15.9.
6.pH value varies from 2.740 to 4.010 with a median being 3.310.
## Distribution across quality
```{r echo=FALSE, Univariate_Plots10,message=FALSE, warning=FALSE}
qplot(x = quality, data =pf)
```
## Normal Distribution of Frequency
The following variables have a normal or close-to-normal distribution: fixed.acidity, volatile.acidity, density, pH and alcohol. Distribution of the variable citric.acid frequency is not normal.
```{r echo=FALSE, Univariate_Plots4,message=FALSE, warning=FALSE}
create_qplot <- function(variable,p) {
return(qplot(x = variable, data = pf)+ xlab(p))
}
n1<-create_qplot(pf$fixed.acidity,"Fixed acidity")
n2<-create_qplot(pf$volatile.acidity,"Volatile acidity")
n3<-create_qplot(pf$citric.acid,"Citric.acid")
n4<-create_qplot(pf$density,"Density")
n5<-create_qplot(pf$pH,"pH")
n6<-create_qplot(pf$alcohol,"Alcohol")
grid.arrange(n1,n2,n3,n4,n5,n6,ncol = 2)
```
## Transforming Data
The following list of variables is not a normal or close-to-normal distribution: residual.sugar, chlorides, free.sulfur.dioxide, total.sulfur.dioxide and sulphates. The hitograms of all these variables are right-skewed a lot and need some transformation.
```{r echo=FALSE, Univariate_Plots5,message=FALSE, warning=FALSE}
l1<-create_qplot(pf$residual.sugar,"residual sugar")
l2<-create_qplot(pf$chlorides,"Chlorides")
l3<-create_qplot(pf$free.sulfur.dioxide,"Sulphur dioxide")
l4<-create_qplot(pf$total.sulfur.dioxide,"Total sulfur dioxide")
l5<-create_qplot(pf$sulphates,"sulphates")
grid.arrange(l1, l2, l3, l4 ,l5 , ncol = 2)
```
After tranforming data using the log10-transformation to make the data look more like normal distribution.
```{r echo=FALSE, Univariate_Plots6,message=FALSE, warning=FALSE}
l1a<-create_qplot(log10(pf$residual.sugar),"Residual sugar ")
l2a<-create_qplot(log10(pf$chlorides),"Chorides")
l3a<-create_qplot(log10(pf$free.sulfur.dioxide),"free sulfur dioxide ")
l4a<-create_qplot(log10(pf$total.sulfur.dioxide),"total sulfur dioxide")
l5a<-create_qplot(log10(pf$sulphates),"sulphates")
grid.arrange(l1, l2a, l3a, l4a ,l5a , ncol = 2)
```
## Adding new variables
### 1.sd_prop
I decided to create a new variable, sd_prop, the proportion of free sulfur dioxide to total sulfur dioxide. I found it interesting that there were a few values that were more common, mostly notably 0.5. Overall, the new variable looks fairly normally distributed.
```{r echo=FALSE, Univariate_Plots7,message=FALSE, warning=FALSE}
pf$sd_prop<-(pf$free.sulfur.dioxide/pf$total.sulfur.dioxide)
qplot(x = sd_prop, data = pf)+
geom_histogram(stat ="count")
```
### 2.quality.cat
The variable quality is of numeric type int, which is not convenient for the analysis. So first of all, I change the type of the quality variable to factor and add it to the dataframe as a new variable quality.factor. In addition, I create three categories of quality - good (>= 7), bad (<=4), and medium (5 and 6).For that I created new variable called "quality.cat" in data, In this variable good means quality 7 and 8 wines ,medium consist of quality 5 and 6 wines and bad wines are of quality 3 and 4.
```{r echo=FALSE, Univariate_Plots8,message=FALSE, warning=FALSE}
# Creating quality.cat variable in data
pf$quality.factor <- factor(pf$quality)
pf$quality.cat <- NA
pf$quality.cat <- ifelse(pf$quality>=7, 'good', NA)
pf$quality.cat <- ifelse(pf$quality<=4, 'bad', pf$quality.cat)
pf$quality.cat <- ifelse(pf$quality==5, 'medium', pf$quality.cat)
pf$quality.cat <- ifelse(pf$quality==6, 'medium', pf$quality.cat)
pf$quality.cat <- factor(pf$quality.cat, levels = c("bad", "medium", "good"))
```
```{r echo=FALSE, Univariate_Plots12,message=FALSE, warning=FALSE}
ggplot(aes(x = quality.cat), data =pf)+
geom_histogram(stat="count")
```
# Univariate Analysis
### What is the structure of your dataset?
I chose to work with the red wine dataset suggested by Udacity. It contains 1,599 observations of 12 variables. All of the variables are numeric, except quality, which is an integer. Other than the quality rating, all of the variables are
different chemical properties of wine.
### What is/are the main feature(s) of interest in your dataset?
My main interest in the dataset is to explore and better understand which chemical properties influence the quality rating of red wine. Based on the dataset documentation, I think that volatile acidity and citric acid might have the greatest impact on quality.
### What other features in the dataset do you think will help support your \investigation into your feature(s) of interest?
I think that it is possible that any of the other chemical variables could impact the quality rating, but I think that the relationships may be hard to quantify because they may not be linear. I think that having so many wines rated a 5 or 6 will make it challenging to put together a strong model.
### Did you create any new variables from existing variables in the dataset?
I created a new variable called sulfur dioxide proportion, which is the proportion of free sulfur dioxide to total sulfur dioxide. Since these two variables naturally have a relationship, I thought that combining them might be beneficial to the analysis.I also created new variable called quality factor for ease of calculation and quality.cat which divides quality in good (>= 7), bad (<=4), and medium (5 and 6)
### Of the features you investigated, were there any unusual distributions? \Did you perform any operations on the data to tidy, adjust, or change the form \of the data? If so, why did you do this?
Some of the variables had a slight right skew (residual sugar, sulphates) but overall were still fairly normal. Free sulfur dioxide and total sulfur dioxide had a more pronounced right skew, but when I combined them into a new variable, it was normally distributed.
Chlorides had a more pronounced right skew, so I transformed it using a log transformation. I did this to see if it would become more normally distributed, so that it could potentially be more useful in a model, or in a graph. Citric acid also had a noticeable right skew, but the log transformation simply shifted it in the opposite direction.Due to this pattern, I'm not sure this transformed variable will be useful.
There were a few variables with noticeable outliers (volatile acidity, residual sugar, sulphates) that may need to be removed to create useful plots or for use in a model.
# Bivariate Plots Section
In this section, I analyse relationships between wine characteristics and its perceived quality, as well as possible correlations between different characteristics.
Lets look at correlation between different variables.
```{r echo=FALSE, Bivariate_Plots0,message=FALSE, warning=FALSE}
pf$X <- NULL #X variable is removed since it is just ID number and hence not required in analysis
cor(pf[sapply(pf, is.numeric)])
```
I removed X variable in the data which is not used in any further analysis.
```{r echo=FALSE, Bivariate_Plots,message=FALSE, warning=FALSE}
ggcorr(pf)
```
```{r echo=FALSE, Bivariate_Plots1,message=FALSE, warning=FALSE}
cor(x=pf[sapply(pf, is.numeric)], y=pf$quality)
```
From above we see that the following variables are correlated with quality:
Alcohol (+++)
Volatile acidity (-)
Citric acid (++)
Fixed acidity (+)
Sulphates (+)
Total sulphur dioxide (-)
Density (-)
Chlorides (-)
In order to compare statistical data of each variable visually, I use boxplots.
## Increasing Quality of Wine
The following set of boxplots show all the cases when peceived wine quality increases together with increasing values of a characteristic's variable.
```{r echo=FALSE, Bivariate_Plots2,message=FALSE, warning=FALSE}
# I have created function to produce qplots and then store them in the variable for later use in grid.arrange()
create_qplot1 <- function(variable,p)
return(qplot(x = quality.cat,y = variable, data = pf)+
geom_boxplot(alpha = .8,color = 'blue') +
ylab(p) +
stat_summary(fun.y = "mean",
geom = "point",
color = "red",
shape = 8,
size = 4)+
geom_jitter(alpha = .1))
p1up<-create_qplot1(pf$alcohol,"Alcohol")
p2up<-create_qplot1(pf$sulphates,"Sulphates")
p3up<-create_qplot1(pf$fixed.acidity,"fixed acidity")
p4up<-create_qplot1(pf$citric.acid,"citric acid")
grid.arrange(p1up, p2up, p3up, p4up, ncol = 2)
```
## Decreasing Quality of Wine
The next set of boxplots, on the contrary, show all the cases when peceived wine quality decreases while the values of variables increase.
```{r echo=FALSE, Bivariate_Plots3,message=FALSE, warning=FALSE}
p1d<-create_qplot1(pf$volatile.acidity,"volatile acidity")
p2d<-create_qplot1(pf$pH,"pH")
p3d<-create_qplot1(pf$density,"Density")
grid.arrange(p1d, p2d, p3d , ncol = 2)
```
We already saw a tendency in the boxplots. But let's use a scatter plot here, including a linear regression line.
```{r echo=FALSE, Bivariate_Plots4,message=FALSE, warning=FALSE}
# I have created function to produce ggplots
create_ggplot1 <- function(variable,n)
return(ggplot(aes(x = pf$quality, y = variable), data = pf)+
geom_jitter(alpha = 1/3) +
geom_smooth(method = "lm", aes(group = 1))+
xlab("Wine Quality")+
ylab(n))
create_ggplot1(pf$alcohol,"Alcohol")
```
```{r echo=FALSE, Bivariate_Plots5,message=FALSE, warning=FALSE}
create_ggplot1(pf$citric.acid,"Citric Acid")
```
We can see a tendency. Good wines tend to have higher citric acid levels.
```{r echo=FALSE, Bivariate_Plots6,message=FALSE, warning=FALSE}
create_ggplot1(pf$volatile.acidity,"Volatile Acidity")
```
We can see the negative influence of volatile acidity in a wine's quality score.
## Relating different variables
The ggpairs output uses groups histograms for qualitative/qualitative variables and scatterplots for quantitative/quantitative variables in the lower triangle of the plot. In the upper triangle, it provides boxplots for the qualitative/quantitative pairs of variables, and correlation coefficients for quantitative/quantitative pairs.
```{r echo=FALSE, Bivariate_Plots7,message=FALSE, warning=FALSE}
set.seed(1234) # created random simple by using seed function
ggpairs(pf[sample.int(nrow(pf),1000),])+
theme(axis.text = element_blank())
```
By focusing on the pH column, I see that there could be a relationship between density and pH, as well as between pH and citric.acid. There is also a relationship between pH and fixed.acidity. The correlation between pH and these three variables is similar and always negative.
```{r echo=FALSE, Bivariate_Plots8,message=FALSE, warning=FALSE}
# I have created function to produce ggplots and then store them in the variable for later use in grid.arrange()
create_ggplot2 <- function(variable,p)
return(ggplot(aes(x = pH, y = variable), data = pf)+
geom_point(alpha = 1/5, position = position_jitter(h = 0)) +
coord_trans(x = "log10") +
geom_smooth(method = "lm", color = "red")+
ylab(p))
dens<-create_ggplot2(pf$density,"density")
citr.ac<-create_ggplot2(pf$citric.acid,"citric acid")
fix.ac<-create_ggplot2(pf$fixed.acidity,"fixed acidity")
grid.arrange(dens, citr.ac, fix.ac , ncol = 2)
```
The highest positive correlation, according to a ggpairs matrix, is between density and fixed.acidity, as well as between fixed.acidity and citric.acid.
```{r echo=FALSE, Bivariate_Plots9,message=FALSE, warning=FALSE}
# I have created function to produce ggplots and then store them in the variable for later use in grid.arrange()
create_ggplot3 <- function(variable,p)
return(ggplot(aes(x = fixed.acidity, y = variable), data = pf)+
geom_smooth(method = "lm", color = "yellow")+
geom_jitter(alpha = 1/5) +
ylab(p))
pos1<-create_ggplot3(pf$density,"density")
pos2<-create_ggplot3(pf$citric.acid,"citric acid")
grid.arrange(pos1, pos2, ncol = 2)
```
# Bivariate Analysis
### Talk about some of the relationships you observed in this part of the \investigation. How did the feature(s) of interest vary with other features in \the dataset?
Although the wine dataset features many different chemical attribute variables, most of them didn't show much of a relationship with the variable of interest, wine quality. In the initial correlation matrix, it appears that volatile acidity, sulphates, and alcohol have the greatest association with wine quality.
In the more detailed examination of each chemical attribute variable and how it varies with wine quality, some interesting patterns did emerge.
### Did you observe any interesting relationships between the other features /(not the main feature(s) of interest)?
Some of the strongest associations were between chemical attribute variables, and not the main feature of interest (wine quality). There was a strong positive relationship between citric acid and fixed acidity, and negative relationships between citric acid and volatile acidity, and citric acid and pH, density and pH. There were also relationships between fixed acidity and pH and fixed acidity and density.
### What was the strongest relationship you found?
The strongest relationship between any chemical attribute variable and the variable of interest was between alcohol and quality (r-squared value of 0.476). The strongest relationship between any two variables was between pH and fixed acidity (r-squared value of -0.68).
# Multivariate Plots Section
In the previous boxplots we have seen that increasing levels of both sulphates and alcohol increase a perceived quality of red wine. Now, I create a scatterplot to see wheather a combination of these two may help distinguish between different quality levels.
```{r echo=FALSE, multivariate_Plots,message=FALSE, warning=FALSE}
# I have created function to produce scatterplot ,limiting quantile from 0.01 to 0.99 and using scale_color_brewer with differentiating on quality.cat by colour
create_ggplot4 <- function(variable1,variable2,a,b,C)
return(ggplot(aes(x = variable1 , y = variable2 , colour = quality.cat), data = pf)+
geom_point(alpha = 0.8, size = 1) +
scale_color_brewer(palette = C) +
scale_x_continuous(lim=c(quantile((pf$variable1), 0.01),
quantile((pf$variable1), 0.99)))+
scale_y_continuous(lim=c(quantile(pf$variable2, 0.10),
quantile(pf$variable2, 0.90)))+
geom_smooth(aes(colour = quality.cat), se = FALSE)+
xlab(a)+
ylab(b))
create_ggplot4(pf$fixed.acidity, pf$citric.acid,"Fixed.acidity","Citric.acid","Greens")
```
citric.acid and fixed.acidity show strong relation with each other and we can observe most of the good quality wines on right top of plot that means for higher quantity of both the variables.
```{r echo=FALSE, Multivariate_Plots,message=FALSE, warning=FALSE}
create_ggplot4(pf$citric.acid, pf$volatile.acidity , "Citric.acid", "Volatile.acidity","Greys" )
```
As we can see from the plot majority of good quality wines lies in lower part of plot.
```{r echo=FALSE, Multivariate_Plots1,message=FALSE, warning=FALSE}
ggplot(aes(x = log10(sulphates), y = alcohol, colour = quality.factor),
data = pf) +
geom_point(alpha = 0.8, size = 1) +
geom_smooth(method = "lm", se = FALSE,size=1)+
scale_color_brewer(type='seq',
guide=guide_legend(title='Quality')) +
scale_x_continuous(lim=c(quantile(log10(pf$sulphates), 0.01),
quantile(log10(pf$sulphates), 0.99)))+
scale_y_continuous(lim=c(quantile(pf$alcohol, 0.01),
quantile(pf$alcohol, 0.99)))+
ggtitle('Alcohol and sulphates relation for different quality wines')+
xlab("potassium sulphate in g / dm3")+
ylab("Alcohol % by volume")
```
The plot reveals a clear pattern, showing most of darker dots (high-quality wine) in the place where both alcohol and sulphates level are high. There is also a visible range of 6 quality dots in the middle of the plot, and the zone of 4-5 quality dots in the bottom-left corner. This implies that such a combination of variables lets distinguish between different levels of medium-quality wines (5 and 6).
The previous plots show there is a positive corelation between the variables of density and fixed.acidity, so I create a dcatterplot to see wheather these two variables explain the quality changes well.
```{r echo=FALSE, Multivariate_Plots2,message=FALSE, warning=FALSE}
create_ggplot4(pf$fixed.acidity, pf$density,"Fixed.acidity","Density","Oranges")
```
Although the plot is not very clear, it reveals some patterns in presented data. It is visible here that the majority of the medium & bad quality wines are in upper part of plot and good quality wines are in lower part. Thus, this combination of variables may be useful to distinguish medium quality wine from the good quality.
Other way of distinguishing between quality of wine is by using alcohol and volatile.acidity variables
```{r echo=FALSE, Multivariate_plot3,message=FALSE, warning=FALSE}
create_ggplot4(pf$alcohol, pf$volatile.acidity,"Alcohol","Volatile.acidity","Purples")
```
## Regression Model
I mostly use combinations of two and more variables for the multiple regression model predicting the quality of red wine. First combination consists of all the variables that increase the quality with their increasing levels. Next combination is density and fixed.acidity as its visual representation implied its value for predicting the quality. Next goes volatile.acidity, as this variable has the highest negative correlation coefficient with the quality variable. And the last combination consists of pH, total.sulfur.dioxide and free.sulfur.dioxide, based on the last step of the previous EDA.
```{r echo=FALSE, regression_model}
m1 <- lm(quality ~ alcohol*sulphates*citric.acid*fixed.acidity, data = pf)
m2 <- update(m1, ~ . + density*fixed.acidity)
m3 <- update(m2, ~ . + volatile.acidity)
m4 <- update(m3, ~ . + pH*total.sulfur.dioxide*free.sulfur.dioxide)
mtable(m1, m2, m3, m4)
```
The given model explains 40% of cases in the given dataset. The highest R-squared = 0.333 is provided by the first combination of parameters (alcohol, sulphates, citric.acid, fixed.acidity). Next three sets of features add 0.01-0.03 to the previous R-squared value.
This model has limitations. It is based on the limited data that does not provide very high (more than 8) and very low (less than 3) quality scores. Collecting the data with more cases of extreme scores, as well as additional data with existing low-quality scores (3 and 4), could significantly improve the model's predictive power.
# Multivariate Analysis
### Talk about some of the relationships you observed in this part of the \investigation. Were there features that strengthened each other in terms of \looking at your feature(s) of interest?
I think that citric acid and fixed acidity strengthened each other when looking at wine quality, because both appear to be positively associated with wine quality. In the plot of citric acid vs. fixed acidity with the points colored by wine quality category, most of the higher-quality wines are found in the upper right corner of the plot.Similar pattern is observed in sulphates vs alcohol plot.
### Were there any interesting or surprising interactions between features?
I thought that the most interesting interaction was between alcohol and volatile acidity. Although both alcohol and volatile acidity are associated with wine quality, they don't have a very strong relationship with each other. However, when you plot them together you can see how wines of different qualities separate into different sides of the plot.
### OPTIONAL: Did you create any models with your dataset? Discuss the \strengths and limitations of your model.
I created a linear model that used all the variables except residual.sugar and chlorides as the predictor variables and wine quality as the outcome variable. The overall r-squared value for the model wasn't very high (0.4) but I think that it is still a useful model because it shows the relative importance of each of the predictor variables, as the r-squared value increases slightly with each addition. The most important predictor in the model is alcohol. I think that one of the strengths of the model is that it is clear and easy to interpret. The model is limited by the lack of variation in the dataset among the outcome variable, wine quality.
------
# Final Plots and Summary
### Plot One
```{r echo=FALSE, final_Plots,message=FALSE, warning=FALSE}
grid.arrange(p1up + ylab("Alcohol % by volume"), p4up + ylab("citric acid in g / dm^3"), p1d + ylab("volatile acidity (acetic acid - g / dm^3)"), p2d, ncol = 2)
```
### Description One
Alcohol and citric acid are two characteristics that increase a perceived quality of wine the most. pH and volatile acidity, on the contrary, reduce a perceived quality the most.
### Plot Two
```{r echo=FALSE, Plot_Two,message=FALSE, warning=FALSE}
ggplot(aes(x = log10(sulphates), y = alcohol, colour = quality.factor),
data = pf) +
geom_point(alpha = 0.8, size = 1) +
geom_smooth(method = "lm", se = FALSE,size=1)+
scale_color_brewer(type='seq',
guide=guide_legend(title='Quality')) +
scale_x_continuous(lim=c(quantile(log10(pf$sulphates), 0.01),
quantile(log10(pf$sulphates), 0.99)))+
scale_y_continuous(lim=c(quantile(pf$alcohol, 0.01),
quantile(pf$alcohol, 0.99)))+
ggtitle('Alcohol and sulphates relation for different quality wines')+
xlab("potassium sulphate in g / dm3")+
ylab("Alcohol % by volume")
```
### Description Two
The plot reveals a clear pattern, showing most of darker dots (high-quality wine) in the place where both alcohol and sulphates level are high. There is also a visible range of 6 quality dots in the middle of the plot, and the zone of 4-5 quality dots in the bottom-left corner. This implies that such a combination of variables lets distinguish between different levels of medium-quality wines (5 and 6).
### Plot Three
```{r echo=FALSE, Plot_Three,message=FALSE, warning=FALSE}
create_ggplot4(pf$alcohol, pf$volatile.acidity,"Alcohol % by volume","acetic acid in g / dm^3(volatile.acidity)","Purples")+ggtitle('Alcohol and volatile.acidity relation for different quality wines')
```
### Description Three
Wines with good quality has lesser volatile.acidity in them. white dots are in the upper part of plot which means bad quality wines has more volatile.acidity in them.
### summary
citric acid and fixed acidity strengthened each other when looking at wine quality, because both appear to be positively associated with wine quality.
The strongest relationship between any chemical attribute variable and the variable of interest was between alcohol and quality (r-squared value of 0.476). The strongest relationship between any two variables was between pH and fixed acidity (r-squared value of -0.68).Multiple regression model is able to explain up to 40% of existing cases in the dataset. Additional dataset with more data of extreme quality cases (both high and low-quality) should help improve the results of this model.
# Reflection
This dataset contained 1,599 observations of red wines. Each wine received a quality rating and had information on 11 different chemical attributes. My main interest in exploring this dataset was to try to learn about how the chemical attributes of a wine might be associated with its quality rating. Through a combination of graphical and statistical analysis, I was able to assess the different relationships between the predictor and outcome variables.
Although there were many different chemical attribute variables included in the dataset, few seemed to have a noticeable association with the wine quality rating. Of the variables included in the dataset, only three stood out: alcohol (r-squared value of 0.476), volatile acidity (r-squared value of -0.391), and sulphates (r-squared value of 0.251). All the variables except chlorides and residual.sugar were chosen for inclusion in a linear model (r-squared value of 0.400). The overall model had a lower r-squared value than the individual components due to the interactions between the variables.
One of the major challenges in this analysis was the limitations of the dataset. The variable of interest, wine quality, was an integer value measured on a scale of 0 to 10. However, the vast majority of the wines (1,319 out of 1,599) received a score of 5 or 6. Only 63 wines received a score of 3 or 4, and 217 wines received a score of 7 or 8. No wines received scores of 0, 1, 2, 9, or 10. Since the wine quality variable had such limited variability, it was difficult to assess the relationship between quality and the chemical attribute variables. Additional dataset with more data of extreme quality cases (both high and low-quality) should help improve the results of this model. Moreover, more sophisticated prediction models should be able to provide more accurate predictions for the quality of wine based on its chemical characteristics.
Aside from the issues with the wine quality variable, I really enjoyed working with this dataset. Having such complete data on so many different chemical attributes provided many opportunities to assess the interactions between different predictor variables as well as the variable of interest. It would be very interesting to combine this dataset with the white wine dataset to see if there are any associations that are similar or different between the two types of wine. It would be fascinating to see how the chemical attributes compare between red and white wines and if the same predictor variables are associated with wine quality.