A deep learning project that enables real-time facial emotion recognition and responds with matching emoji reactions. Built using CNNs, attention mechanisms (SE blocks), and Vision Transformers (ViTs), the project demonstrates the strengths of modern AI for human-computer interaction through facial expressions.
Facial emotion recognition is critical in applications like surveillance, healthcare, driver safety, and entertainment. This project implements and compares three architectures:
- A baseline CNN
- An SE-augmented attention CNN
- A hybrid CNN+Vision Transformer (ViT)
These models were trained and evaluated on the FER2013 and a subset of AffectNet datasets using techniques like focal loss, data augmentation, and class weighting. Real-time inference is achieved using OpenCV to overlay detected emotions live from webcam input.
- 35,887 grayscale images (48x48 px)
- 7 emotions: Angry, Disgust, Fear, Happy, Neutral, Sad, Surprise
⚠️ Important:
Download the FER2013 dataset from this Kaggle link
Once downloaded, extract and place it inside your working directory like so:emojify/ ┗ data/ ┣ train/ ┗ test/
- 12,815 RGB images
- Same 7 emotions (excluding “contempt”)
⚠️ Important:
Download the AffectNet dataset (subset) from this Kaggle link
Once downloaded, extract and place it inside your working directory like below and delete the contempt folder from train and test subfolder:emojify/ ┗ affdata/ ┣ train/ ┗ test/
All images were resized to 48x48, normalized, and augmented to ensure training efficiency and model generalization.
pip install -r requirements.txt
This section outlines the functionality of each Python script in the project and the dataset it is based on.
-
gui_base_cnn.py
→ Real-time facial emotion detection using the Base CNN model trained on FER2013. -
gui_attn_cnn.py
→ Real-time detection using the Attention-enhanced CNN (SE blocks) model trained on FER2013. -
gui_cnn_vit.py
→ Real-time detection using the CNN + Vision Transformer hybrid model trained on FER2013.
-
train_base_cnn.py
→ Trains a Base CNN model on the FER2013 dataset. -
train_attn_cnn.py
→ Trains a CNN model with Squeeze-and-Excitation attention on FER2013. -
train_cnn_vit.py
→ Trains a CNN + Vision Transformer hybrid model on FER2013.
-
train2_base_cnn.py
→ Trains a Base CNN model on the AffectNet dataset. -
train2_cnn_attn.py
→ Trains a CNN model with attention layers (multi-head attention) on AffectNet. -
train2_cnn_vit.py
→ Trains a CNN + Vision Transformer hybrid model on AffectNet.
- 3 convolutional layers + max pooling
- Dense layer (1024 units) + Softmax
- ~5M parameters
- Adds Squeeze-and-Excitation (SE) blocks
- Emphasizes important facial features
- ~6.2M parameters
- CNN extracts local features
- Transformer captures global context
- ~9M parameters
- Optimizer: Adam (
lr=0.0001,decay=1e-6) - Epochs: 75
- Batch Size: 64
- Class weights: Based on inverse class frequencies
- Loss Function: Focal loss (to handle class imbalance)
-
Face Detection: OpenCV Haar Cascade / DNN
-
Inference Pipeline:
- Capture webcam frame
- Detect face
- Preprocess (resize to 48x48 grayscale)
- Predict emotion
- Overlay corresponding emoji on the frame
-
Performance:
- Baseline CNN and SE-CNN run smoothly in real-time.
- CNN+ViT performs well but with slight lag.
| Dataset | Model | Accuracy | Validation Accuracy |
|---|---|---|---|
| FER2013 | Base CNN | 52.66% | 58.87% |
| FER2013 | Attention CNN | 51.35% | 57.97% |
| FER2013 | CNN+ViT | 52.41% | 55.60% |
| AffectNet | Base CNN | 44.99% | 48.29% |
| AffectNet | Attention CNN | 34.67% | 39.83% |
| AffectNet | CNN+ViT | 40.22% | 38.85% |
- The Baseline CNN provided the best trade-off between accuracy and efficiency for real-time deployment.
- The SE-CNN added interpretability by focusing on key facial regions.
- The CNN+ViT hybrid showed robustness but was computationally more intensive.
- Explore lightweight models like MobileNet and EfficientNet for edge deployment.
- Implement multimodal emotion recognition combining facial expressions with voice or body language.
- Expand dataset diversity to improve cross-population generalization.
- Nithish Gowda H N - Btech(Hons.) CSE, AI & ML Major
- Prajna - Btech(Hons.) CSE, Cloud & Full Stack Major
- Pratham Rajesh Vernekar - Btech(Hons.) CSE, Cloud & Full Stack Major
- Nandan Kumar - Btech(Hons.) CSE, Cloud & Full Stack Major)
Real-time inference showing the model identifying facial expressions
