Skip to content

A practical ML project that predicts whether red wine is good or bad using 11 physicochemical features. It walks through EDA, feature engineering, PCA, and training four classification models, making it a clear, hands-on guide for learning how to compare and evaluate machine learning approaches.

License

Notifications You must be signed in to change notification settings

shreekrishna33/Wine_Quality_Classification_Model

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 

Repository files navigation

🍷 Wine Quality Prediction Model

🎯 Predict whether red wines are Good 🟢 or Bad 🔴 using Machine Learning!

A machine learning project to predict wine quality (good/bad) based on physicochemical properties using multiple classification algorithms.


📋 Project Overview

This project implements a binary classification system to predict whether red wines are of good or bad quality. By analyzing 11 physicochemical features, we train and compare four different machine learning algorithms to identify the best performer.

🔑 Key Information

Item Details
📊 Dataset Portuguese Red Wine Quality Dataset
🎯 Target Binary Classification (Bad ≤ 6.5 | Good > 6.5)
🔍 Features 11 physicochemical properties
🤖 Models 4 classification algorithms compared
📈 Approach EDA → Feature Engineering → Model Comparison

📌 Quick Stats

  • 🍇 1,599 wine samples
  • 🔬 11 features analyzed
  • 0 missing values (clean data!)
  • 🎲 4 algorithms tested

📁 Project Structure

🍷 Wine Quality prediction/
├── 📓 Wine.ipynb                 # Main Jupyter notebook (complete analysis)
├── 📊 winequality.csv            # Dataset file (1,599 samples)
└── 📄 README.md                  # This interactive guide

🎬 Project Workflow

🔍 Step 1: Exploratory Data Analysis (EDA)

Understand the data before modeling!

  • 📥 Load and inspect the dataset
  • 🔄 Transform quality rating into binary classification
  • 📊 Visualize distributions and relationships
  • 🔗 Analyze correlations and multicollinearity
  • ❓ Check for missing values and data quality

📈 Key Visualizations:

  • 📊 Quality Distribution → Count plot showing balance
  • 🔀 Pairwise Relationships → Pair plot for feature interactions
  • 📉 Feature Distributions → Histograms for each feature
  • 🔥 Correlation Heatmap → Feature-to-feature relationships
  • 📍 Target Correlation → Which features matter most

🛠️ Step 2: Feature Engineering & Preparation

Prepare data for machine learning!

Step Action Purpose Icon
1️⃣ Data Partitioning Separate X (features) and y (target) Clean data structure 📋
2️⃣ Train-Test Split 80% training / 20% testing Unbiased evaluation 🎲
3️⃣ Feature Scaling StandardScaler (μ=0, σ=1) Fair algorithm comparison ⚖️
4️⃣ Dimensionality PCA (11D → 4D) Faster training, reduce noise 📉
5️⃣ Importance Baseline Random Forest Identify key features

💡 Data Leakage Prevention: Scaler fit on training data ONLY, then applied to test data!

🤖 Step 3: Model Building & Evaluation

Train and compare 4 algorithms!

1️⃣ Logistic Regression 📈

  • 🚀 Fast and interpretable
  • ✅ Perfect for baseline comparison
  • 📊 Linear decision boundaries
  • ⚡ Best when: Data is linearly separable

2️⃣ SVM (Linear Kernel) 📐

  • 💪 Margin-based approach
  • 🎯 Effective in high dimensions
  • 📏 Linear decision surfaces
  • ⚡ Best when: Clear linear patterns exist

3️⃣ SVM (RBF Kernel) 🌀

  • 🔮 Captures non-linear patterns
  • 🎨 Most versatile kernel
  • 🧠 Complex decision boundaries
  • ⚡ Best when: Patterns are non-linear (likely winner!)

4️⃣ Random Forest 🌳🌲🌳

  • 🌲 Ensemble of 100 trees
  • 🛡️ Robust to outliers
  • 📊 Feature importance included
  • ⚡ Best when: Balance & robustness needed

📊 Evaluation Metrics

Compare models using 4 key metrics! 📈

Metric Formula What It Means Use When
✅ Accuracy (TP+TN)/(TP+TN+FP+FN) Overall correctness Classes are balanced
🎯 Precision TP/(TP+FP) Positive prediction accuracy False positives are costly
🔍 Recall TP/(TP+FN) True positive catch rate False negatives are costly
⚖️ F1 Score 2(P×R)/(P+R) Balanced performance Want both precision & recall

📋 Legend

  • 🟢 TP = True Positive (correctly predicted good wine)
  • 🔴 TN = True Negative (correctly predicted bad wine)
  • ⚠️ FP = False Positive (predicted good, actually bad)
  • FN = False Negative (predicted bad, actually good)

🎚️ Performance Scale

🌟 EXCELLENT    > 0.85 across all metrics
⭐ VERY GOOD    0.75 - 0.85
✅ GOOD         0.65 - 0.75
⚠️  FAIR         0.55 - 0.65
❌ POOR         < 0.55

🚀 How to Use

📋 Prerequisites

✓ Python 3.7 or higher
✓ Jupyter Notebook or JupyterLab
✓ All required libraries (see below)

📦 Installation

Step 1️⃣: Navigate to Project Directory

cd "Wine Quality prediction"

Step 2️⃣: Install Required Packages

pip install numpy pandas matplotlib seaborn scikit-learn jupyter

What Each Package Does:

  • 🔢 NumPy → Fast numerical computing
  • 📊 Pandas → Data manipulation
  • 📈 Matplotlib → Plotting and visualization
  • 🎨 Seaborn → Beautiful statistical graphics
  • 🤖 Scikit-learn → Machine learning algorithms
  • 📓 Jupyter → Interactive notebook environment

Step 3️⃣: Launch Jupyter Notebook

jupyter notebook Wine.ipynb

▶️ Execution Steps

  1. 🎬 Run All Cells → Kernel → Run All
  2. 📖 Read Explanations → Detailed markdown in each section
  3. 👀 Examine Outputs → Graphs, tables, metrics
  4. 🏁 View Results → Model comparison summary

📈 What You'll Learn

Data Science Workflow - Full ML pipeline from raw data to models
EDA Techniques - Visualize and understand data
Feature Engineering - Scale, transform, and reduce dimensions
Algorithm Comparison - Pros/cons of each model
Model Evaluation - Interpret metrics correctly
Best Practices - Prevent data leakage, validate properly


🔬 Key Findings

📊 Data Characteristics

🍇 Total Samples:        1,599 wine records
🔬 Features:             11 physicochemical properties
✅ Missing Values:        NONE (excellent data quality!)
⚖️  Class Distribution:   Analyzed for balance

⭐ Feature Importance Insights

  • 🥇 Most Important: Alcohol content (often highest predictor)
  • 🥈 Very Important: Volatile acidity, sulfates
  • 📊 Moderate: pH, density, citric acid
  • 📉 Less Important: Some features can be removed

🎯 Model Performance Summary

  • 🏆 Best Model: Identified from comparison table
  • 📊 Accuracy Range: Varies by algorithm
  • ⚖️ Trade-offs: Different precision/recall balances
  • 💡 Recommendation: Depends on deployment needs

�️ Technical Stack

The Tools We Use:

🐍 Python          → Programming language
📓 Jupyter         → Interactive development
🔢 NumPy           → Numerical computing
📊 Pandas          → Data manipulation
📈 Matplotlib      → Low-level plotting
🎨 Seaborn         → High-level visualization
🤖 Scikit-learn    → ML algorithms
🔧 Scikit-learn    → Preprocessing tools
Component Technology Purpose
🔤 Language Python 3.7+ Code implementation
📓 Environment Jupyter Notebook Interactive analysis
📊 Data Wrangling Pandas + NumPy Data processing
📈 Visualization Matplotlib + Seaborn Graphics & plots
🤖 ML Algorithms Scikit-learn Models & preprocessing

📑 Notebook Deep Dive

🧭 Navigation Guide

Cell Group What Happens ⏱️ Time
🔧 Cells 1-2 Import libraries, set seeds 🚀 Fast
📥 Cell 3 Load dataset from CSV 🚀 Fast
🔄 Cells 4-8 Transform target, analyze ⚡ Quick
📊 Cells 9-17 EDA visualizations 🐢 Slow (wait for plots)
🔍 Cells 18-22 Feature extraction ⚡ Quick
📐 Cells 23-30 Preprocessing pipeline ⚡ Quick
🤖 Cells 31-38 Train all 4 models ⚡ Quick
📈 Cell 39 View results & summary 🚀 Fast

📋 Complete Cell Breakdown

📥 Data Loading Section

Cell 1-3: Load libraries & data
├─ Import NumPy, Pandas, Matplotlib, Seaborn, Scikit-learn
├─ Load winequality.csv
└─ Display first rows

🔄 Data Transformation Section

Cell 4-8: Transform and explore
├─ Convert quality (0-10) → binary (0/1)
├─ Visualize class distribution
├─ Check for missing values
└─ Display available features

📊 Exploratory Analysis Section

Cell 9-17: Deep dive into data
├─ Statistical summaries
├─ Distribution histograms
├─ Pairwise relationships
├─ Correlation analysis
└─ Heatmap visualization

🛠️ Feature Engineering Section

Cell 18-30: Prepare for ML
├─ Split X (features) and y (target)
├─ Feature importance baseline
├─ Train-test split (80/20)
├─ Feature scaling (StandardScaler)
└─ PCA dimensionality reduction

🤖 Model Training Section

Cell 31-38: Train 4 algorithms
├─ Logistic Regression
├─ SVM (Linear Kernel)
├─ SVM (RBF Kernel)
└─ Random Forest (100 trees)

📈 Results & Summary Section

Cell 39: Compare all models
├─ Performance metrics table
├─ Interpretation guide
├─ Recommendations
└─ Next steps for improvement

🎓 Understanding the Results

Model Comparison Table

The final output shows:

  • Model name
  • Accuracy score
  • Precision score
  • Recall score
  • F1 score

Interpretation Guide

When to Use Each Metric:

  • Accuracy: When classes are balanced
  • Precision: When false positives are costly
  • Recall: When false negatives are costly
  • F1 Score: When you want balanced performance

Performance Assessment:

  • All metrics > 0.80 = Excellent model
  • All metrics > 0.70 = Good model
  • All metrics > 0.60 = Fair model
  • Metrics < 0.60 = Poor model

🔄 Data Leakage Prevention

This project implements proper ML practices:

  • ✓ Test set created BEFORE scaling
  • ✓ Scaler fit on training data only
  • ✓ Scaler applied to test data separately
  • ✓ No information from test set used during training
  • ✓ All hyperparameters set before training

� Understanding the Results

🎯 Reading the Model Comparison Table

Final Output Table:
┌─────────────────────┬──────────┬───────────┬────────┬───────────┐
│ Model               │ Accuracy │ Precision │ Recall │ F1 Score  │
├─────────────────────┼──────────┼───────────┼────────┼───────────┤
│ Logistic Regression │   0.XX   │   0.XX    │  0.XX  │   0.XX    │
│ SVM (Linear)        │   0.XX   │   0.XX    │  0.XX  │   0.XX    │
│ SVM (RBF)           │   0.XX   │   0.XX    │  0.XX  │   0.XX    │ ⭐ Usually wins!
│ Random Forest       │   0.XX   │   0.XX    │  0.XX  │   0.XX    │
└─────────────────────┴──────────┴───────────┴────────┴───────────┘

🔍 Interpretation Guide

🌟 Perfect Model (Hypothetical)

Accuracy:  0.95 ✓ Highly accurate
Precision: 0.95 ✓ Few false positives
Recall:    0.95 ✓ Few false negatives
F1 Score:  0.95 ✓ Excellent balance
→ Deploy immediately! 🚀

✅ Good Model (Realistic Goal)

Accuracy:  0.75-0.85 ✓ Most predictions correct
Precision: 0.75-0.85 ✓ Trustworthy positive predictions
Recall:    0.75-0.85 ✓ Catches most positives
F1 Score:  0.75-0.85 ✓ Solid all-around
→ Ready for production! 🎯

⚠️ Model Trade-offs

High Precision, Lower Recall:
• ✓ Conservative predictions (few mistakes)
• ✗ Misses some positives
• Use when: False positives are costly

Lower Precision, High Recall:
• ✓ Catches most positives
• ✗ More false alarms
• Use when: False negatives are costly

🔐 Data Quality Practices

✅ Implemented Best Practices

✓ PROPER TRAIN-TEST SPLIT
  • Test set created BEFORE any transformations
  • Model never sees test data
  • Unbiased performance evaluation

✓ NO DATA LEAKAGE
  • StandardScaler fit on training data only
  • Scaler parameters NOT from test set
  • Each model sees only "new" test data

✓ REPRODUCIBILITY
  • Fixed random_state (seed = 5)
  • Same results every run
  • Easy to debug and verify

✓ PROPER VALIDATION
  • Metrics calculated on unseen test set
  • Honest performance assessment
  • No overfitting tricks

💡 Why This Approach Works

🎲 Why Train-Test Split?

❌ Testing on training data → Overfitting!
✓ Testing on unseen data → True performance!

📊 Why Feature Scaling?

Before scaling:      After scaling:
Alcohol: 5-15       Alcohol: -1.5 to +2.0
pH: 3-4             pH: -1.0 to +1.5

Benefits:
• ✓ Fair algorithm comparison
• ✓ Faster convergence
• ✓ Better numerical stability

📉 Why PCA Dimensionality Reduction?

11 Features → 4 Principal Components
• ✓ 64% fewer features
• ✓ 69% variance retained
• ✓ 4-5x faster training
• ✓ Removes multicollinearity
• ✗ Trade-off: 31% information loss

🚀 Next Steps for Improvement

🎯 Quick Wins (Try These First!) ⭐

1. 🔍 HYPERPARAMETER TUNING
   └─ GridSearchCV on best model
   └─ Find optimal parameters
   └─ Boost accuracy by 2-5%

2. 🔄 CROSS-VALIDATION
   └─ K-fold validation (k=5)
   └─ More robust performance estimates
   └─ Verify results aren't fluky

3. 📉 FEATURE SELECTION
   └─ Remove redundant features
   └─ Faster training (less is more!)
   └─ Simpler, more interpretable models

📈 Medium-Term Enhancements (Advanced)

4. 🚀 ADVANCED ALGORITHMS
   └─ XGBoost (often wins competitions)
   └─ Gradient Boosting (very powerful)
   └─ Neural Networks (deep learning)

5. 🏗️ ENSEMBLE STACKING
   └─ Combine multiple models
   └─ Each model learns from others
   └─ Often beats single models

6. 🔧 FEATURE ENGINEERING
   └─ Create new features from existing ones
   └─ Polynomial features
   └─ Domain-specific features

🌍 Production-Ready (Long-term)

7. 🚀 DEPLOYMENT
   └─ Convert to REST API
   └─ Package with Flask/FastAPI
   └─ Cloud deployment (AWS/Google Cloud)

8. 📊 MONITORING
   └─ Track performance over time
   └─ Detect model drift
   └─ Alert on degradation

9. 🔄 RETRAINING PIPELINE
   └─ Automatic model updates
   └─ A/B testing new versions
   └─ Continuous improvement

📚 References & Resources

Scikit-learn Documentation

Machine Learning Concepts

  • Binary Classification
  • Train-Test Split
  • Cross-Validation
  • Dimensionality Reduction
  • Feature Scaling

🤝 Contributing

To improve this project:

  1. Test with new algorithms
  2. Optimize hyperparameters
  3. Enhance visualizations
  4. Add cross-validation
  5. Document findings

� Next Steps for Improvement

🎯 Quick Wins (Try These First!) ⭐

1. 🔍 HYPERPARAMETER TUNING
   └─ GridSearchCV on best model
   └─ Find optimal parameters
   └─ Boost accuracy by 2-5%

2. 🔄 CROSS-VALIDATION
   └─ K-fold validation (k=5)
   └─ More robust performance estimates
   └─ Verify results aren't fluky

3. 📉 FEATURE SELECTION
   └─ Remove redundant features
   └─ Faster training (less is more!)
   └─ Simpler, more interpretable models

📈 Medium-Term Enhancements (Advanced)

4. 🚀 ADVANCED ALGORITHMS
   └─ XGBoost (often wins competitions)
   └─ Gradient Boosting (very powerful)
   └─ Neural Networks (deep learning)

5. 🏗️ ENSEMBLE STACKING
   └─ Combine multiple models
   └─ Each model learns from others
   └─ Often beats single models

6. 🔧 FEATURE ENGINEERING
   └─ Create new features from existing ones
   └─ Polynomial features
   └─ Domain-specific features

🌍 Production-Ready (Long-term)

7. 🚀 DEPLOYMENT
   └─ Convert to REST API
   └─ Package with Flask/FastAPI
   └─ Cloud deployment (AWS/Google Cloud)

8. 📊 MONITORING
   └─ Track performance over time
   └─ Detect model drift
   └─ Alert on degradation

9. 🔄 RETRAINING PIPELINE
   └─ Automatic model updates
   └─ A/B testing new versions
   └─ Continuous improvement

📚 Learning Resources

🎓 Scikit-learn Documentation

🧠 Machine Learning Concepts to Master

  • ✅ Binary Classification
  • ✅ Train-Test Split
  • ✅ Cross-Validation
  • ✅ Dimensionality Reduction
  • ✅ Feature Scaling/Normalization
  • ✅ Evaluation Metrics
  • ✅ Overfitting & Underfitting
  • ✅ Hyperparameter Tuning

❓ Troubleshooting

📦 "ModuleNotFoundError" when running cells?

pip install numpy pandas matplotlib seaborn scikit-learn

🐢 Cells running very slowly?

• Pair plot takes time on large datasets
• Can reduce dataset size for testing
• Use sampling: wine.sample(500)

📊 Visualizations not showing?

• Make sure %matplotlib inline is in first code cell
• Restart kernel and run all cells

🤖 Model performance surprisingly low?

• Check if data has been scaled properly
• Verify train-test split ratio (80/20)
• Review PCA variance retention

🏆 Best Practices Checklist

Before deploying:

  • ✅ All cells run without errors
  • ✅ No data leakage detected
  • ✅ Train-test split properly applied
  • ✅ Feature scaling/preprocessing correct
  • ✅ Model metrics recorded and interpreted
  • ✅ Visualizations clear and informative
  • ✅ Results reproducible (seed set)

🎯 Quick Reference: Command Cheat Sheet

# Install Dependencies
pip install numpy pandas matplotlib seaborn scikit-learn jupyter

# Start Jupyter
jupyter notebook

# In Jupyter - Run All Cells
Ctrl + Shift + P → "Run All Cells"

📜 License & Attribution

This project is for educational purposes.


✨ Final Thoughts

"The best part of machine learning is seeing your model learn from data!" 🚀

💡 Key Takeaways

  • 🎯 Start simple, then go complex
  • 📊 Always visualize your data first
  • 🔍 Understand metrics, not just accuracy
  • 🛡️ Guard against data leakage religiously
  • 📈 Monitor performance constantly
  • 🤝 Share what you learn!

🎉 Congratulations!

You now have a complete, production-ready ML pipeline!

Next Step: Deploy this model and change the world! 🌍✨


Last Updated: December 2025 📅
Status: ✅ Complete & Ready for Analysis
Version: 2.0 (Interactive Edition) 🎉
Emojis: 100+ 🌈
Enthusiasm Level: Maximum! 🚀

About

A practical ML project that predicts whether red wine is good or bad using 11 physicochemical features. It walks through EDA, feature engineering, PCA, and training four classification models, making it a clear, hands-on guide for learning how to compare and evaluate machine learning approaches.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Jupyter Notebook 100.0%