-
Notifications
You must be signed in to change notification settings - Fork 0
/
ProjectSummary.Rmd
806 lines (536 loc) · 29.5 KB
/
ProjectSummary.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
---
title: "Wine Quality Prediction"
author: "Tanmay Ambegaokar, Bhumika Mallikarjun , Kundana Chowdary Cherukuri, Jagadeeshwar Kalyanapu"
date: "`r Sys.Date()`"
# date: ""
output:
rmdformats::readthedown:
# code_folding: hide
# number_sections: false
# df_print: "kable"
# toc: yes
# toc_depth: 3
# toc_float: yes
toc_float: true
toc_depth: 3
number_sections: false
code_folding: hide
df_print: "kable"
# pdf_document:
# toc: yes
# toc_depth: '3'
---
```{r init, include=F}
library(ezids)
library(tidyverse)
library(dplyr)
library(ggplot2)
library(gridExtra)
library(knitr)
```
```{r setup, include=FALSE}
# some of common options (and the defaults) are:
# include=T, eval=T, echo=T, results='hide'/'asis'/'markup',..., collapse=F, warning=T, message=T, error=T, cache=T, fig.width=6, fig.height=4, fig.dim=c(6,4) #inches, fig.align='left'/'center','right',
# knitr::opts_chunk$set(warning = F, results = "markup", message = F)
knitr::opts_chunk$set(warning = F, message = F)
options(scientific=T, digits = 3)
# options(scipen=9, digits = 3)
```
## Introduction
### Abstract
Wine making is an intricate process, with many chemical and physical variables influencing the quality and taste of the final product. According to the Grand View Research the United States wine market size was estimated at $63.69 billion in 2021. As wine consumers ourselves, our team wanted to better understand the key drivers of wine quality using data-driven techniques.
Prior studies have examined the chemical properties of wines, but few have conducted extensive statistical analysis to relate these properties to overall quality ratings from wine experts. Our goal was to bridge this gap using exploratory data analysis and statistical modeling.
The analysis will shed light on the factors that contribute to the quality of both red and white wines. It will examine the various chemical components that make up each type of wine. These ingredients each have a distinct relationship with the resulting character of the red or white wine. Understanding these relationships is valuable for wine drinkers who have preferences for certain styles of wine.
### Why we chose this topic?
We believe that our efforts to decipher the complex relationship between wine quality and its chemical makeup will be extremely beneficial to wine producers and distributors. This research provides winemakers with crucial assistance to develop wines that have the required characteristics. With a better knowledge of how elements like density, chlorides, and other elements affect wine quality, they may improve their manufacturing methods so that they constantly meet and surpass consumer expectations.
The consumer can also benefit from this analysis. They can understand which ingredients make the quality better and can ask for those wines. They will also understand if the price they're paying is commensurate with the quality.
----------------------------------------------------------------------------------------------
## Data Description
The dataset includes two tables, one for the red wine and one for the white wine. All of the variables in both datasets are synonymous, thus only one data understanding table was constructed. The exact number of columns and rows for each table (type of wine) is included below:
Wine quality-red
12 columns x 1,599 rows
Wine quality-white
12 columns x 4,898 rows
```{r head, echo=FALSE}
wine <- read.csv("winequalityN.csv")
xkabledplyhead(wine)
```
The different columns present are:
**Categorical Variable**
1. Type - Colour of wine [ Red or White]
**Numerical Variables**
2. Fixed Acidity - Concentration of non-volatile tartaric acid in the wine
3. Volatile Acidity - Concentration of volatile acetic acid in the wine.
4. Citric Acid - Concentration of citric acid in the wine.
5. Residual Sugar - Concentration of sugar remaining after the fermentation in the wine.
6. Chlorides - Concentration of sodium chloride in the wine.
7. Free Sulfur Dioxide - Concentration of free, gaseous sulfur dioxide in the wine.
8. Total Sulfur Dioxide - Total concentration of sulfur dioxide in the wine.
9. Density - Density of the wine.
10. pH - Acidity of the wine.
11. Sulphates - Concentration of potassium sulfate in the wine.
12. Alcohol - Alcohol content of the wine.
**Target Variable**
13. Quality - Wine quality score as assessed by experts.
We use quality as our base factor and have constructed our analysis around. The Quality ranges from 3 - 9, where *3 - Bad Qualtiy* and *9 - Good Quality* and all that is in between is said to be *Average Quality* wine!
----------------------------------------------------------------------------------------------
## Research Questions - SMART QUESTIONS!
1. Which chemical attribute has the most **significant** influence on wine quality?
2. How does the distribution of sugar, sulfate, chloride, pH, and alcohol content change over **different quality categories**?
3. From the above analysis how is it **different between red and white wine?**
The target variable for our research project is : **QUALITY**
----------------------------------------------------------------------------------------------
## Evolution of Questions
Before we began with our analysis as a data scientists, we questioned ourselves based on our previous experiences of having tasted wine that in general what would effect the overall quality. We summarized that maybe the amount of alcohol present would effect it or the sweetness of the wine, or maybe even the acidity could be an influential factor. Keeping these thoughts in mind we wanted to see if our initial assumptions were true or were we going to find something more interesting through our analysis and also determine to what extent these factors effect the quality!
----------------------------------------------------------------------------------------------
## Data Cleaning
The data which we have right now is in unclean. Meaning that there are a lot of NA values and duplicates. The first step of any data science project is to clean the data.
Right now, we have **6497 rows and 13 variables.**
Steps to clean the data:
1. Remove NA values
```{r cleanna,echo=FALSE}
wine <- na.omit(wine)
```
2. Remove duplicate values.
```{r clean, echo=FALSE}
wine <- unique(wine)
```
The clean data has **5295 rows and 13 variables**. It is free from NA values and duplicates and now can be used for further exploratory data analysis.
----------------------------------------------------------------------------------------------
## Summary Statistics
Now, that we have a completely clean data lets look at the summary statistics of the dataset.
```{r summary, echo=FALSE}
xkablesummary(wine)
```
A quick look at the summary table tells us that:
1. Two columns are characters- Type and Quality.
2. The other nine columns are numeric in nature.
3. We can see that for most of our columns, the **Mean > Median**. This means our data has a positive skew.
4. One can see that most characteristics have some significant outliers as the maximum value is much bigger than their third quantile.
Our next step is to visualize these using plots.
----------------------------------------------------------------------------------------------
## Exploratory Data Analysis (EDA)
## Univariate Analysis
### **1. Quality**
```{r quality , echo=FALSE }
ggplot(data = wine, aes(x = quality)) +
geom_bar(width = 0.8, color = 'black', fill = I('yellow')) +
labs(
title = "Overall Wine Quality",
x = "Quality",
y = "Data - Red & white wine"
)
```
**Observations**
1. Wine quality shows a rather **symmetrical distribution.**
2. Most wines have a **quality score of 6.**
3. No wine achieved the highest score of 10 and the worst wines got a rating of 3.
### **2. Acidity**
```{r univariate2, echo=FALSE}
p7 <- ggplot(data = wine, aes(x = fixed.acidity)) +
geom_bar( fill = I('blue')) +
labs(
title = "Fixed Acidity",
x = "TaOH Concentration [g/L]",
y = "Data"
)
p8 <- ggplot(data = wine, aes(x = volatile.acidity)) +
geom_bar( fill = I('blue')) +
labs(
title = "Volatile Acidity",
x = "AcOH Concentration [g/L]",
y = "Data"
)
p1 <- ggplot(data = wine, aes(x = citric.acid)) +
geom_bar(fill = I('blue')) +
labs(
title = "Citric Acidity",
x = "Concentration [g/L]",
y = "Data"
)
p2 <- ggplot(data = wine, aes(x = pH)) +
geom_bar( fill = I('blue')) +
labs(
title = "pH",
x = "pH",
y = "Data"
)
grid.arrange(p7,p8, p1,p2, nrow = 2)
```
```{r boxplotsacid, echo=FALSE}
p1 <- ggplot(data = wine, aes(x = "", y = fixed.acidity )) +
geom_boxplot(color = 'black', fill = I('white')) +
labs(
x = "Fixed Acidity",
y = "TaOH Concentration [g/L]"
)
p2 <- ggplot(data = wine, aes(x = "", y = volatile.acidity)) +
geom_boxplot(color = 'black', fill = I('white')) +
labs(
x = "Volatile Acidity",
y = "AcOH Concentration [g/L]"
)
p3 <- ggplot(data = wine, aes(x = "", y = citric.acid)) +
geom_boxplot(color = 'black', fill = I('white')) +
labs(
x = "Citric Acidity",
y = "Concentration [g/L]"
)
p4 <- ggplot(data = wine, aes(x = "", y = pH))+
geom_boxplot(color = 'black', fill = I('white')) +
labs(
x = "pH",
y = "pH"
)
grid.arrange(p1,p2, p3,p4, nrow = 1)
```
**Observations**
1. Looking at the acidity parameters in boxplots gives a similar picture. One can see the long positive tails of fixed and volatile acide concentrations.
2. Citric Acid has a **bimodal distribution.**
3. The pH level of wines has a normal distribution with median of 3.2
### **3. Sulphates**
```{r sulphates, echo=FALSE}
p8 <- ggplot(wine, aes(x = sulphates)) +
geom_histogram(binwidth = 0.01, fill = I('blue')) +
labs(
title = "Sulphates",
x = "Concentration (mg/L)",
y = "Data"
)
p9 <- ggplot(wine, aes(x = free.sulfur.dioxide)) +
geom_histogram(binwidth = 2, fill = I('blue')) +
labs(
title = "Free Sulphur Dioxide Concentration",
x = "Concentration (g/L)",
y = "Data"
)
p10 <- ggplot(wine, aes(x = total.sulfur.dioxide)) +
geom_histogram(binwidth = 10, fill = "blue") +
labs(
title = "Total Sulfur Dioxide Concentration",
x = "Concentration (mg/L)",
y = "Data"
)
grid.arrange(p8, p9,p10, nrow = 2)
```
```{r boxplotsulphates, echo=FALSE}
p1 <- ggplot(data = wine, aes(x = "", y = free.sulfur.dioxide ))+
geom_boxplot(color = 'black', fill = I('white')) +
labs(
x = "Free Sulfur Dioxide",
y = "TaOH Concentration [g/L]"
)
p2 <- ggplot(data = wine, aes(x = "", y = total.sulfur.dioxide)) +
geom_boxplot(color = 'black', fill = I('white')) +
labs(
x = "Total Sulfur dioxide",
y = "AcOH Concentration [g/L]"
)
p3 <- ggplot(data = wine, aes(x = "", y = sulphates)) +
geom_boxplot(color = 'black', fill = I('white')) +
labs(
x = "Sulphates",
y = "Concentration [g/L]"
)
grid.arrange(p1, p2,p3, nrow = 2)
```
**Observations**
1. Free sulfur dioxide concentration is narrowly centered around 30 mg/L.
2. Total sulfur dioxide concentration shows signs of bimodality with peaks around 20 and 120 mg/L.
3. Most wines have a sulphate concentration around 0.5 g/L. Two small outlier groups around 1.6 and 1.9 g/L can be seen in the boxplot.
### **4. Sugar, Alcohol, Density, Chlorides Plots**
```{r leftover, echo=FALSE}
p3 <- ggplot(data = wine, aes(x = residual.sugar)) +
geom_histogram(binwidth = 1, fill = I('blue')) +
labs(
title = "Residual Sugar",
x = "Residual Sugar (g/L)",
y = "Data"
)
p4 <- ggplot(data = wine, aes(x = density)) +
geom_histogram(binwidth = 0.002, fill = I('blue')) +
labs(
title = "Density",
x = "Density",
y = "Data"
)
p5 <- ggplot(data = wine, aes(x = chlorides)) +
geom_histogram(binwidth = 0.005, fill = I('blue')) +
labs(
title = "Chlorides",
x = "Chloride Content (g/L)",
y = "Data"
)
p6 <- ggplot(data = wine, aes(x = alcohol)) +
geom_histogram(binwidth = 1, fill = I('blue')) +
labs(
title = "Alcohol Content",
x = "Alcohol Content (% by volume)",
y = "Data"
)
grid.arrange(p3, p4,p5,p6, nrow = 2)
```
```{r boxplotleftover, echo=FALSE}
p1 <- ggplot(data = wine, aes(x = "", y = residual.sugar))+
geom_boxplot(color = 'black', fill = I('white')) +
labs(
x = "Residual Sugar",
y = "Concentration [g/L]"
)
p2 <- ggplot(data = wine, aes(x = "", y = density)) +
geom_boxplot(color = 'black', fill = I('white')) +
labs(
x = "Density",
y = "g/cm"
)
p3 <- ggplot(data = wine, aes(x = "", y = alcohol)) +
geom_boxplot(color = 'black', fill = I('white')) +
labs(
x = "Alcohol",
y = "vol%"
)
p4 <- ggplot(data = wine, aes(x = "", y = chlorides)) +
geom_boxplot(color = 'black', fill = I('white')) +
labs(
x = "Chloride",
y = "Concentration [g/L]"
)
grid.arrange(p1, p2,p3,p4, nrow = 2)
```
**Observations**
1. Generally, the wines in the data set appear to have low residual sugar concentrations. The positive skewing moves the mean value (5.4) above the median (3.0). An extreme outlier can be found around 65 g/L residual sugar.
2. The density parameter shows a very narrow distribution with low variation. One can see a few outliers around 1.01 and 1.04 g/cm3 but most wines have a density between 0.99 and 1.00 g/cm3.
3. The histogram showing the chlorine concentration in the data set has two distinct main peaks. The most frequent chlorine concentrations can be found around 0.04 g/L. The second peak appears at about 0.08 g/L. The distribution has a very long tail in the positive direction with outliers up to 0.6 g/L.
4. The alcohol content of the wines in the data set ranges between 8 and 15 vol%. The median lies around 10 vol%. The distribution is rather wide and shows positive skewing.
----------------------------------------------------------------------------------------------
## Correlation Matrix
Before, we get into further detailed analysis we want to see which are the the variables which significantly influence the quality of the wines. For this we have done a correlation matrix.
```{r cor, echo=FALSE}
# Subset the dataset into red and white wines
red <- wine[wine$type == "red", ]
white <- wine[wine$type == "white", ]
numeric_data <- subset(wine, select = -c(type))
loadPkg("corrplot")
cor_matrix <-cor(numeric_data)
corrplot(cor_matrix, method="circle",type="upper")
```
Our target variable is **Quality** and hence, we will explore the variables that are most correlated with Quality.
From the correlation matrix we can see that the 4 most correlated variables are:
1. Alcohol
2. Density
3. Chlorides
4. Citric Acid
----------------------------------------------------------------------------------------------
## Bivariate Analysis
Through our correlation plot we have understood that Alcohol content, density, citric acid and chloride are the ones that are affecting the quality the most, let us see how and also make a comparative analysis between red and white wine individually!
### **1. Alcohol vs Quality**
```{r alcqual, echo=FALSE}
library(ggplot2)
ggplot(wine, aes(x = as.factor(quality), y = alcohol)) +
geom_boxplot(fill = "brown", color = "darkblue") +
labs(x = "Wine Quality", y = "Alcohol") +
ggtitle("Alcohol vs Quality")
```
```{r colalc,echo=FALSE}
library(ggplot2)
red <- subset(wine, type == "red")
white <- subset(wine, type == "white")
p1 <-ggplot() +
geom_boxplot(data = red, aes(x = as.factor(quality), y = alcohol, fill = "red"), width = 0.4) +
labs(x = "Wine Quality", y = "Alcohol", fill = "Wine Color") +
ggtitle("Box Plot of Alcohol vs Quality for Red Wines") +
scale_fill_manual(values = c("red" = "red"))
p2<-ggplot() +
geom_boxplot(data = white, aes(x = as.factor(quality), y = alcohol, fill = "white"), width = 0.4) +
labs(x = "Wine Quality", y = "Alcohol", fill = "Wine Color") +
ggtitle("Box Plot of Alcohol vs Quality for White Wines") +
scale_fill_manual(values = c("white" = "white"))
grid.arrange(p1,p2,nrow=2)
```
**Observations**
1. White Wines have **higher** alcohol content.
2. Alcohol has a **strong positive correlation** with quality.
3. The boxplot shows that wines with **higher quality** seem to have **higher alcohol content.**
### **2. Density vs Quality**
```{r density, echo=FALSE}
library(ggplot2)
# Density Distribution by Wine Quality
ggplot(wine, aes(x = factor(quality), y = density)) +
geom_boxplot(fill = "lightblue") +
labs(x = "Quality", y = "Density") +
ggtitle("Density Distribution by Wine Quality")
```
```{r denqual, echo=FALSE}
# Box Plot of Density vs Quality for Red Wines
p1 <-ggplot() +
geom_boxplot(data = red, aes(x = as.factor(quality), y = density, fill = "light coral"), width = 0.4) +
labs(x = "Wine Quality", y = "Density", fill = "light coral") +
ggtitle("Box Plot of Density vs Quality for Red Wines") +
scale_fill_manual(values = c("light coral" = "light coral"))
# Box Plot of Density vs Quality for White Wines
p2 <- ggplot() +
geom_boxplot(data = white, aes(x = as.factor(quality), y = density, fill = "red"), width = 0.4) +
labs(x = "Wine Quality", y = "Density", fill = "red") +
ggtitle("Box Plot of Density vs Quality for White Wines") +
scale_fill_manual(values = c("red" = "red"))
grid.arrange(p1,p2,nrow=2)
```
**Observations:**
1. Red Wines are **more dense** than white wines.
2. Density has a **negative correlation** with quality.
3. The boxplot shows that wines with **higher quality** seem to be **less dense.**
### **3. Chlorides vs Quality**
```{r chlorides, echo=FALSE}
ggplot(wine, aes(x = factor(quality), y = chlorides)) +
geom_boxplot(fill = "lightcoral") +
labs(x = "Quality", y = "Chlorides") +
ggtitle("Chloride Distribution by Wine Quality") +
ylim(0, 0.2)
```
```{r chloralcohol, echo=FALSE}
p1 <-ggplot(red, aes(x = factor(quality), y = chlorides)) +
geom_boxplot(fill = "yellow") +
labs(x = "Quality", y = "Chlorides") +
ggtitle("Chloride Distribution by Wine Quality (Red Wine)") +
ylim(0, 0.3)
p2<-ggplot(white, aes(x = factor(quality), y = chlorides)) +
geom_boxplot(fill = "coral") +
labs(x = "Quality", y = "Chlorides") +
ggtitle("Chloride Distribution by White Wine Quality") +
ylim(0, 0.2)
grid.arrange(p1,p2,nrow=2)
```
**Observations:**
1. Red Wines have more chloride concentration than white wines.
2. Chloride Concentration has a **slight negative correlation** with quality
3. The boxplot shows that wines with **higher quality** seem to have **less chlorides.**
### **4. Citric Acid vs Quality**
```{r citric, echo=FALSE}
ggplot(wine, aes(x = factor(quality), y = citric.acid)) +
geom_boxplot(fill = "lightpink") +
labs(x = "Quality", y = "Citric Acid") +
ggtitle("Citric Acid Distribution by Wine Quality")+
ylim(0.0,0.15)
```
```{r citricqual, echo=FALSE}
p1 <-ggplot(red, aes(x = factor(quality), y = citric.acid)) +
geom_boxplot(fill = "green") +
labs(x = "Quality", y = "Citric Acid") +
ggtitle("Citric Acid Distribution by Red Wine Quality")+
ylim(0.0,0.15)
p2<- ggplot(white, aes(x = factor(quality), y = citric.acid)) +
geom_boxplot(fill = "maroon") +
labs(x = "Quality", y = "Citric Acid") +
ggtitle("Citric Acid Distribution by White Wine Quality")+
ylim(0.0,0.15)
grid.arrange(p1,p2,nrow=2)
```
**Observations:**
1. White Wines have more citric acid concentration than red wines.
2. Citric Acid Concentration has a **slight positive correlation** with quality.
3. There isn't much difference in citric acid concentration in white wines across the quality ratings.
4. The boxplot shows that wines with **higher quality** seem to have a **high citric acid.**
----------------------------------------------------------------------------------------------
## Multivariate Analysis
For the last part of our EDA, we will perform some multivariate plots to see some how the other non-important features in wine are distributed in red and white wine.
```{r multi, echo=FALSE}
library(ggplot2)
p1<-ggplot(wine, aes(x = residual.sugar, y = density, color = factor(type))) +
geom_point() +ggtitle("Scatter plot for Density vs Sugar ") +
labs(x = "Sugar", y = "Density", color = "Wine Type")
p2<-ggplot(wine, aes(x = alcohol, y = density, color = factor(type))) +
geom_point() + ggtitle("Scatter plot for Density vs Alcohol ") +
labs(x = "Alcohol", y = "Density", color = "Wine Type")
p3<- ggplot(wine, aes(x = alcohol, y = chlorides, color = factor(type))) +
geom_point() + ggtitle("Scatter plot for Chlorides vs Alcohol ") +
labs(x = "Alcohol", y = "Chlorides", color = "Wine Type")
p4<-ggplot(wine, aes(x = sulphates, y = residual.sugar, color = factor(type))) +
geom_point() +ggtitle("Scatter plot for Sulphates vs Sugar ") +
labs(x = "Sulphates", y = "Residual Sugar", color = "Wine Type")
grid.arrange(p1,p2,p3,p4,nrow=2)
```
**Observations:**
1. White Wines have more sugar concentration than red wines. This might explain why white wines are usually sweeter.
2. Red wines contain high chloride and sulphate concentrations.
----------------------------------------------------------------------------------------------
## Statistical Tests
We have performed a few statistical tests to support our analysis.
### **1. Alcohol vs Quality**
**WELCH TWO SAMPLE T-TEST:**
NULL HYPOTHESIS (H0): There is no significant mean difference between red and white wine in alcohol content.
ALTERNATE HYPOTHESIS (H1): There is significant mean difference between the two wines.
```{r alct, echo=FALSE}
t_test_alcohol <- t.test(red$alcohol,white$alcohol, level=0.05)
p <- t_test_alcohol$p.value
```
We found that the **p-value : `r p` **is less than 0.05, we **reject the null hypothesis** and conclude that there is a significant difference in mean alcohol concentration between red and white wine.
**ANOVA TEST FOR RED AND WHITE WINE:**
NULL HYPOTHESIS (H0): There is no significant difference in mean alcohol content across quality categories in red wine/white wine.
ALTERNATE HYPOTHESIS (H1): There is significant difference in mean alcohol content across quality categories in red wine/white wine
We carried out separate tests for each wine and we found that the p-value for both red and white wine are less that 0.05, we **reject the null hypothesis** and conclude that there is a significant difference in mean alcohol content across all categories of both the wine.
### **2. Density vs Quality**
**WELCH TWO SAMPLE T-TEST:**
NULL HYPOTHESIS (H0): There is no substantial mean density difference between red and white wine.
ALTERNATE HYPOTHESIS (H1): The two wines have a considerable mean difference.
```{r densityt,echo=FALSE}
t_test_density <- t.test(red$density,white$density, level=0.05)
p <- t_test_density$p.value
```
We found that the **p-value : `r p`** is less than 0.05, we **reject the null hypothesis** and conclude that there is a significant difference in mean density concentration between red and white wine.
**RED AND WHITE WINE ANOVA TEST:**
NULL HYPOTHESIS (H0): There is no significant variation in mean density between red wine and white wine quality groups.
ALTERNATE HYPOTHESIS (H1): There is a considerable difference in mean density between red wine and white wine quality groups.
We carried out separate ANOVA tests for each wine and we found that the p-value for both red and white wine are less that 0.05, we **reject the null hypothesis** and conclude that there is a significant difference in mean density content across all categories of both the wine.
### **3. Chloride vs Quality**
**WELCH TWO SAMPLE T-TEST:**
NULL HYPOTHESIS (H0): There is no significant mean difference in chloride concentration between red and white wine.
ALTERNATE HYPOTHESIS (H1): There is a significant mean difference between the two wines.
```{r chloridet,echo=FALSE}
t_test_chloride <- t.test(red$chlorides,white$chlorides, level=0.05)
p <- t_test_chloride$p.value
```
We found that the **p-value : `r p` ** is less than 0.05, we **reject the null hypothesis** and conclude that there is a significant difference in mean chloride concentration between red and white wine.
**ANOVA TEST FOR RED AND WHITE WINE:**
NULL HYPOTHESIS (H0): There is no significant difference in mean chloride concentration across quality categories in red wine/white wine.
ALTERNATE HYPOTHESIS (H1): There is a significant difference in mean chloride concentration across quality categories in red wine/white wine.
We carried out separate ANOVA tests for each wine and we found that the p-value for both red and white wine are less that 0.05, we **reject the null hypothesis** and conclude that there is a significant difference in mean chloride content across all categories of both the wine.
### **4. Citric acid vs Quality**
**WELCH TWO SAMPLE T-TEST:**
NULL HYPOTHESIS (H0): There is no significant mean difference in citric acid content between red and white wine.
ALTERNATE HYPOTHESIS (H1): There is a significant mean difference between the two wines.
```{r citrict, echo=FALSE}
t_test_citric <- t.test(red$citric.acid,white$citric.acid, level=0.05)
p <- t_test_citric$p.value
```
We found that the **p-value: `r p` **is less than 0.05, we **reject the null hypothesis** and conclude that there is a significant difference in mean citric acid concentration between red and white wine.
**ANOVA TEST FOR RED AND WHITE WINE:**
NULL HYPOTHESIS (H0): There is no significant difference in mean citric acid content across quality categories in red wine/white wine.
ALTERNATE HYPOTHESIS (H1): There is a significant difference in mean citric acid content across quality categories in red wine/white wine.
We carried out separate tests for each wine and we found that the p-value for red is less than 0.05, where as for the white wine the p value is greater than 0.05, Hence we **reject the null hypothesis** for red wine only and concluded that there is a significant difference in mean citric content in red wine.
----------------------------------------------------------------------------------------------
## Results
1. Wines with **higher alcohol content, increased citric acid levels, lower density, and less chlorides** tend to exhibit **higher quality**.
2. It appears that **white wines**, in general, tend to be **sweeter** and have **higher alcohol content** when compared to their red counterparts.
3. **Red Wines** have *more* concentration of **sulphates and chlorides**.
----------------------------------------------------------------------------------------------
## Limitations of the Dataset
Although we thoroughly believe in our analysis, we have to mention a few anomalies that are present that may or may not have influenced the results.
1. The number of data that we have on white wine is comparatively more than that of red wine.
2. We have also observed that most of the data present are in the average quality range i.e, from 4-8.
For a better analysis, we need a more balanced data.
----------------------------------------------------------------------------------------------
## Conclusions
Through our exploration of wine quality we have come across some discoveries that challenge wine making beliefs. While **alcohol content, sugar levels and acidity** are known factors we have found that attributes, like **density and chlorides** also play roles in shaping the quality of wine. These elements contribute in ways that captivate our senses and define what truly makes a bottle of wine.
Moreover we have delved into the contrast between white wines going beyond mere color differences to uncover deeper variations in chemical composition and sensory profiles. By understanding these distinctions we empower wine enthusiasts and consumers to make choices based on their personal preferences and specific occasions. With the support of tests such as the **T Test** to check if there is a *significant difference between the factors for red nd white wine* and **ANOVA Tests** to find how these factors *vary across different quality ratings in red and white wine* our findings are substantiated by evidence establishing a strong framework for comprehending the intricate interplay of variables that influence wine quality.
Equipped with this knowledge let us raise our glasses to the fusion of data science and wine making artistry. We also hope that one uses this knowledge to buy great wines.
----------------------------------------------------------------------------------------------
## Additional Insights
Even if our present investigation has been fascinating, we understand that the quest to predict wine quality is a dynamic endeavour. Our commitment to gaining deeper insights is demonstrated by our efforts to include a wider range of attributes that affect wine quality to our dataset. This growth is a calculated step towards a more comprehensive knowledge of the complex process of wine making, not just an increase in data. We want to build a predictive model that goes into the nuances of winemaking and captures every detail that contributes to wine greatness by incorporating a wider variety of variables.
We hope to improve our predictive model in this ongoing investigation to deliver even more thorough and accurate results.
----------------------------------------------------------------------------------------------
## References
We have included necessary citation for the data set from:
1. Cortez, P., Teixeira, J., Cerdeira, A., Almeida, F., Matos, T., Reis, J. (2009). Using Data Mining for Wine Quality Assessment. In: Gama, J., Costa, V.S., Jorge, A.M., Brazdil, P.B. (eds) Discovery Science. DS 2009. Lecture Notes in Computer Science(), vol 5808. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-04747-3_8
2. https://www.kaggle.com/datasets/yasserh/wine-quality-dataset/data
3. https://link.springer.com/chapter/10.1007/978-3-642-04747-3_8
----------------------------------------------------------------------------------------------