This project aims to analyze the chemical properties and sensory quality assessments of white wine varieties produced in a specific region of Portugal. The objective is to explore the relationship between these properties and to identify clusters of similar wines using partitioning clustering techniques. This analysis will help in understanding how chemical properties influence wine quality and can contribute to more objective wine certification and quality assurance processes.
The dataset used in this project (whitewine_v6.xls
) consists of 2700 white wine samples. Each sample has been tested for 12 attributes, including 11 physicochemical properties and 1 sensory quality rating. The physicochemical properties are continuous variables, while the quality rating is an ordinal variable ranging from 1 (worst) to 10 (best).
- fixed acidity: Non-volatile acids in wine.
- volatile acidity: Acetic acid content, high levels lead to vinegar taste.
- citric acid: Adds freshness and flavor to wines.
- residual sugar: Sugar remaining after fermentation.
- chlorides: Salt content in the wine.
- free sulfur dioxide: Prevents microbial growth and oxidation.
- total sulfur dioxide: Total SO2 content.
- density: Wine density, influenced by alcohol and sugar content.
- pH: Acidity/basicity scale (0-14).
- sulphates: Contributes to SO2 levels.
- alcohol: Alcohol content percentage.
- quality: Sensory quality score (1-10).
The project is divided into two main subtasks:
- Pre-processing:
- Scaling the data.
- Outlier detection and removal.
- Determine the Number of Clusters:
- Using four automated tools: NBclust, Elbow, Gap statistics, and silhouette methods.
- K-means Clustering:
- Perform k-means analysis with the chosen number of clusters.
- Evaluate clustering using BSS/TSS ratio, BSS, and WSS indices.
- Silhouette Analysis:
- Provide silhouette plot and average silhouette width score.
- Principal Component Analysis (PCA):
- Reduce dimensionality of the dataset.
- Select principal components with cumulative variance > 85%.
- Determine the Number of Clusters for PCA Data:
- Using the same four automated tools.
- K-means Clustering on PCA Data:
- Perform k-means analysis with the chosen number of clusters.
- Evaluate clustering using BSS/TSS ratio, BSS, and WSS indices.
- Silhouette Analysis for PCA Data:
- Provide silhouette plot and average silhouette width score.
- Calinski-Harabasz Index:
- Evaluate clustering quality using this index.
Ensure you have the following R packages installed:
install.packages(c( "cluster", "factoextra", "NBclust", "readxl", "fpc"))