This repository demonstrates how Principal Component Analysis (PCA) can be applied to pixel-based handwritten digit images.
The project focuses on dimensionality reduction, visual interpretation of components, and understanding how PCA transforms high-dimensional image data into a meaningful latent space.
This project applies Principal Component Analysis (PCA) to the classic Handwritten Digits dataset.
The workflow focuses on:
- Extracting pixel data from an 8×8 grayscale image grid
- Visualizing sample digits as heatmaps and images
- Reducing 64-dimensional pixel data to 2D and 3D PCA space
- Understanding variance explained by principal components
- Visualizing digit clusters based on PCA compressed features
PCA is especially powerful in high-dimensional image tasks where each pixel is a feature.
- Name: Digits Dataset (Scikit-Learn)
- Source: Data set
- Records: 1,797 handwritten digits (0–9)
- Features:
- 64 pixel intensity values (8×8 image)
number_label(target digit)
Each image is stored as a flattened array of 64 grayscale values.
Data is loaded from CSV containing pixel values and the digit label.
- Display dataset info
- Verify 64 pixel columns
- Confirm no missing values
Created a new DataFrame:
pixels = df.drop("number_label", axis=1)
Extracted a single digit representation (first row), converted to NumPy array, and reshaped it into 8×8 grid.
Three visualizations:
matplotlib.imshow(default colormap)matplotlib.imshow(cmap='gray')seaborn.heatmapwith pixel intensities
These help interpret the pixel intensities and confirm data structure.
Used:
StandardScaler()
Scaling is essential because PCA relies on variance, and raw pixel ranges differ across images.
Performed dimensionality reduction:
PCA(n_components=2)- Projected all digits to 2D PCA space
- Visualized using a color-coded scatter plot (
hue = digit label)
Digits form distinct natural clusters, even with only 2 components.
Explained variance of PC1+PC2:
~21.59%
Extended the model:
PCA(n_components=3)- 3D scatter plot using Matplotlib's 3D axis
- Color-coded by digit label
This provides an even clearer separation for some digit classes.
- numpy
- pandas
- seaborn
- matplotlib
- scikit-learn (PCA, StandardScaler)
pip install -r requirements.txt- requirements.txt → File
or directly:
pip install numpy pandas seaborn matplotlib scikit-learnRun the script to generate all visualizations.
- Captures ~21.6% of the variance.
- Despite low variance percentage, digits form recognizable clusters.
- Demonstrates PCA’s ability to compress images while retaining structure.
- Better separation of digits in 3D space.
- Useful for interactive visualization and cluster analysis.
Each digit's raw pixel row can be reshaped back into an 8×8 grid to visually confirm the sample.
- PCA can cluster handwritten digits even with limited components
- 64-dimensional pixel data compresses cleanly into 2D and 3D
- Variance captured by early components contains meaningful structure
- PCA is suitable for visualization, preprocessing, and noise reduction
This project demonstrates how PCA transforms high-dimensional pixel data into low-dimensional latent space:
- Meaningful digit clusters emerge even with 2 or 3 components
- Useful for visualization, feature extraction, and preprocessing for ML models
- Highlights the power of dimensionality reduction on image datasets
PCA remains a foundational tool in exploratory data analysis for image-based machine learning tasks.
Author: Ali
Field: Data Science & Machine Learning Student
Email: ali.hz87980@gmail.com
GitHub: ali-119