We first understand the relevance of the dataset and project. Though our inital goal will bo to focous on features given and try to find relations among them, then will go deeper to find some sharp and important factors for business purpose. There are many countries where soft and hard drinks are still a major or the only product in holding the GDP of the country. We know the writers, physicians many times convinced us about the needness of wine in our daily life, here is one of them,
“Wine makes daily living easier, less hurried, with fewer tensions and more tolerance.”
--- Benjamin Franklin
And also we have seen that the red wine industry has a exponential growth recently as social drinking is on the rise. And so, industry players are using product quality certifications to promote their products. But, this is a time-consuming process and requires the assessment given by human experts, which makes this process very expensive and complex, we will see that also the data set is something different from other ones as a real dataset. And also, industriesset the price of a product depending upon its demand and appriciation of the customers, in this case it is very sesitive and totally depends on the choisce of the customer, so, price of red wine depends on a rather abstract concept of wine appreciation by wine tasters, opinion among whom may have a high degree of variability. Another vital factor in red wine certification and quality assessment is physicochemical tests, which are laboratory-based and consider factors like acidity, pH level, sugar, and other chemical properties. The red wine market would be of interest if the human quality of tasting can be related to wine’s chemical properties so that certification and quality assessment and assurance processes are more controlled. This project aims to determine which features are the best quality red wine indicators and generate insights into each of these factors to our model’s red wine quality.
Now, a brief overview of the Red Wine Quality Dataset.
Our Red Wine Quality Data Set, available on the Kaggle UCI machine learning repository.The dataset contains a total of 12 variables, which were recorded for 1,599 observations. This data will allow us to create different regression models to determine how different independent variables help predict our dependent variable, quality. Knowing how each variable will impact the red wine quality will help producers, distributors, and businesses in the red wine industry better assess their production, distribution, and pricing strategy.
The main aim of the red wine quality dataset is to predict which of the physiochemical features make good wine. With 11 variables and 1 output variable (quality) given, The problems are clearly expalined in the kaggle repository. Let us examine the role of each of these features:
Fixed Acidity: are non-volatile acids that do not evaporate readily
Volatile Acidity: are high acetic acid in wine which leads to an unpleasant vinegar taste
Citric Acid: acts as a preservative to increase acidity. When in small quantities, adds freshness and flavor to wines
Residual Sugar: is the amount of sugar remaining after fermentation stops. The key is to have a perfect balance between sweetness and sourness. It is important to note that wines > 45g/ltrs are sweet
Chlorides: the amount of salt in the wine
Free Sulfur Dioxide: it prevents microbial growth and the oxidation of wine
Total Sulfur Dioxide: is the amount of free + bound forms of SO2
Density: sweeter wines have a higher density
pH: describes the level of acidity on a scale of 0–14. Most wines are always between 3–4 on the pH scale
Alcohol: available in small quantities in wines makes the drinkers sociable
Sulphates: a wine additive that contributes to SO2 levels and acts as an antimicrobial and antioxidant
Quality: which is the output variable/predictor ranging from 0 to 10 , but in the dataset there are wine of qualities ranging from 3 to 8 Now we have a basic knowledge of various factors that influence the quality of good wine.
Our first step will be to clean and prepare the data for analysis. We go through different steps of data cleaning. First, we checke the data types focusing on numerical and categorical to simplify the correlation’s computation and visualization. Second, We try to identify any missing values existing in our data set. Last, We researche each column/feature’s statistical summary to detect any problem like outliers and abnormal distributions.
Data Exploration and Visualization: this helps in effectively interpreting each feature in the wine data
Train the algorithm: using Multivariable Regression and Random Forest Classification to identify patterns and relationship between the targets and features
Evaluate your model (Regression and Classification) using a few metrics: a. Skew: a normal distribution close to zero is a perfect distribution b. MSE (Mean Squared Error): is an absolute measure of fit. Note that an MSE of 0 indicates a perfect fit) c. RMSE (Root Mean Squared Error): is a good measure of how accurate the model predicts the target d. R-Squared: is a relative measure of fit e. Confusion Matrix (Accuracy, Precision, Recall) Also, the use of BIC (Bayesian Information Criterion) for model selection in measuring complexity; where the model with the lower BIC Value,is the preferred.
- Note:Here, We only do exploratory data analysis and some nice visuals from which we can easily infer some inportant information.