Skip to content

CharlesCro/ser-ravdess

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Speech Emotion Recognition

Problem Statement

The goal of this pipeline is to test the effectiveness of combining audio features for the purpose of Speech Emotion Recognition (SER)

Utilizing the RAVDESS dataset, I will examine a neural networks ability to classify the emotion of a sentence based on the feature matrix extracted from this test pipeline.

The model's performance will be evaluated by its F1 Score.

Background

Speech Emotion Recognition is a task of speech processing and computational paralinguistics that aims to recognize and categorize the emotions expressed in spoken language. The goal is to determine the emotional state of a speaker, such as happiness, anger, sadness, or frustration, from their speech patterns, such as prosody, pitch, and rhythm.

The RAVDESS dataset contains 24 professional actors (12 female, 12 male), vocalizing two lexically-matched statements in a neutral North American accent.

Speech includes calm, happy, sad, angry, fearful, surprise, and disgust expressions.

Each expression is produced at two levels of emotional intensity (normal, strong), with an additional neutral expression.

Frameworks to Consider

Feature / Library torchaudio librosa pyDub scipy.signal speechbrain
Primary Use Case Deep learning with audio (PyTorch) Music/audio feature extraction Audio editing/conversion Signal processing Speech processing with pretrained models
Framework Integration Native PyTorch support Independent Independent Independent Built on top of PyTorch
Feature Extraction MFCC, Mel Spectrogram, etc. MFCC, chroma, tonnetz, tempo Limited (only raw audio handling) Limited (manual DSP) Yes (MFCC, Mel, etc.)
I/O Formats WAV, MP3, FLAC, etc. WAV (via soundfile), MP3 (ffmpeg) MP3, WAV, etc. (via ffmpeg) WAV (via wavfile) WAV, MP3, FLAC (via torchaudio)
Transform Pipeline Torch-based transforms NumPy-based Not designed for pipelines Manual chaining of operations Prebuilt and custom modules
Real-Time Audio Basic support No No No Limited (depends on custom setup)
Pretrained Models Yes (e.g., Wav2Vec2, HuBERT) No No No Yes (ASR, speaker ID, etc.)
Visualization Tools Limited Extensive (waveforms, spectrograms) Limited Basic (plotting with matplotlib) Minimal (integration with matplotlib)
Ease of Use Medium (needs PyTorch knowledge) Easy Very Easy Medium Medium
License BSD-Style ISC MIT BSD Apache 2.0
GPU Acceleration Yes (via PyTorch) No No No Yes (via PyTorch)

Notes:

  • torchaudio is ideal for deep learning workflows, especially with PyTorch.
  • librosa is excellent for traditional audio analysis and music research.
  • pyDub is great for basic audio file manipulation (concatenation, conversion).
  • scipy.signal is suitable for manual signal processing operations.
  • speechbrain is focused on building and using pretrained speech models (e.g., ASR, speaker diarization).

A Mel Spectrogram is a very easy to understand heatmap of pitch, intensity, and time, and it’s scaled to represent a human ear's sensitivity to these sounds.

  • The left to right axis is time — like watching a video frame by frame.

  • The up and down axis is pitch — low sounds are at the bottom, high sounds are at the top.

  • The color or brightness shows how strong each sound is — brighter or darker spots mean louder or softer sounds at that pitch and moment.

mel

Mel Frequency Cepstral Coefficients are numbers which essentially represent the fingerprint of this Mel Spectrogram heatmap

  • You start with a spectrogram and compress it down to just the key identifying characteristics.

  • If the Mel Spectrogram were a painting, then the MFCCs would be the brief summary of that painting.

  • They’re often used together — first you make the Mel Spectrogram, then extract the MFCCs to feed into AI or voice recognition tools.

mfcc

The chromagram will represent the rhythm and melody of speech.

Emotions change pitch patterns in speech.

  • A happy or excited person might use a wider range of notes, and jump between them more.

  • A sad person might stay within a narrow, low-pitched range.

  • An angry person might use sharp, spiky changes in pitch.

The chromagram captures those changes — kind of like a musical fingerprint of how someone’s voice moves.

chromagram

Spectral contrast represents the texture of the sound.

This is used in emotion detection to measure how smooth or rough the sound is to tell if something is calm or intense.

Spectral contrast measures the difference between:

  • The loudest parts of a sound (called the peaks)

  • And the quietest parts (called the valleys)

In different pitch ranges (low, mid, high).

spectral

Think of Tonnetz as a map of how emotional tone shifts in speech:

  • Angry speech might jump between harsh, dissonant pitches

  • Sad speech might linger around soft, close, mellow tones

  • Happy or excited speech might bounce through more energetic, “harmonically rich” patterns

Tonnetz doesn’t care about the exact notes — it cares about how those notes relate to each other. That relationship can help AI detect the emotion in someone’s voice.

tonnetz

These features have all been chosen based on extensive research of audio analysis and speech recognition techniques. The next task is to combine them and create a feature matrix which will be fed into a deep learning model.

For testing deep learning models, Google Colab was utilized for increased GPU performance to speed up training times.

  • After testing a variety of models, the Bidirectional LSTM network performed the best.

  • However, results were still lackluster, as overfitting was a serious problem.

  • Final accuracy (63) and F1 score (62) did not meet expectations, I would not recommend this model for production purposes.

Better noise cancellation needs to be considered, as well as more data from other sources which me be a better source of natural speech.

Network topology needs to be reworked to better learn these features. RNNs might be the better choice in this case.

5. Resources

  1. https://github.com/AnkushMalaker/speech-emotion-recognition

  2. https://www.researchgate.net/publication/315638464_Speech_Emotion_Recognition_from_Spectrograms_with_Deep_Convolutional_Neural_Network

  3. https://zenodo.org/records/1188976

  4. https://dl.acm.org/doi/10.1145/3605778

About

Speech Emotion Recognition via Deep Learning

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published