Speech Emotion Recognition

Problem Statement

The goal of this pipeline is to test the effectiveness of combining audio features for the purpose of Speech Emotion Recognition (SER)

Utilizing the RAVDESS dataset, I will examine a neural networks ability to classify the emotion of a sentence based on the feature matrix extracted from this test pipeline.

The model's performance will be evaluated by its F1 Score.

Background

Speech Emotion Recognition is a task of speech processing and computational paralinguistics that aims to recognize and categorize the emotions expressed in spoken language. The goal is to determine the emotional state of a speaker, such as happiness, anger, sadness, or frustration, from their speech patterns, such as prosody, pitch, and rhythm.

The RAVDESS dataset contains 24 professional actors (12 female, 12 male), vocalizing two lexically-matched statements in a neutral North American accent.

Speech includes calm, happy, sad, angry, fearful, surprise, and disgust expressions.

Each expression is produced at two levels of emotional intensity (normal, strong), with an additional neutral expression.

Frameworks to Consider

Feature / Library	torchaudio	librosa	pyDub	scipy.signal	speechbrain
Primary Use Case	Deep learning with audio (PyTorch)	Music/audio feature extraction	Audio editing/conversion	Signal processing	Speech processing with pretrained models
Framework Integration	Native PyTorch support	Independent	Independent	Independent	Built on top of PyTorch
Feature Extraction	MFCC, Mel Spectrogram, etc.	MFCC, chroma, tonnetz, tempo	Limited (only raw audio handling)	Limited (manual DSP)	Yes (MFCC, Mel, etc.)
I/O Formats	WAV, MP3, FLAC, etc.	WAV (via soundfile), MP3 (ffmpeg)	MP3, WAV, etc. (via ffmpeg)	WAV (via `wavfile`)	WAV, MP3, FLAC (via torchaudio)
Transform Pipeline	Torch-based transforms	NumPy-based	Not designed for pipelines	Manual chaining of operations	Prebuilt and custom modules
Real-Time Audio	Basic support	No	No	No	Limited (depends on custom setup)
Pretrained Models	Yes (e.g., Wav2Vec2, HuBERT)	No	No	No	Yes (ASR, speaker ID, etc.)
Visualization Tools	Limited	Extensive (waveforms, spectrograms)	Limited	Basic (plotting with matplotlib)	Minimal (integration with matplotlib)
Ease of Use	Medium (needs PyTorch knowledge)	Easy	Very Easy	Medium	Medium
License	BSD-Style	ISC	MIT	BSD	Apache 2.0
GPU Acceleration	Yes (via PyTorch)	No	No	No	Yes (via PyTorch)

Notes:

torchaudio is ideal for deep learning workflows, especially with PyTorch.
librosa is excellent for traditional audio analysis and music research.
pyDub is great for basic audio file manipulation (concatenation, conversion).
scipy.signal is suitable for manual signal processing operations.
speechbrain is focused on building and using pretrained speech models (e.g., ASR, speaker diarization).

1. Data Dictionary

A `Mel Spectrogram` is a very easy to understand heatmap of pitch, intensity, and time, and it’s scaled to represent a human ear's sensitivity to these sounds.

The left to right axis is time — like watching a video frame by frame.
The up and down axis is pitch — low sounds are at the bottom, high sounds are at the top.
The color or brightness shows how strong each sound is — brighter or darker spots mean louder or softer sounds at that pitch and moment.

`Mel Frequency Cepstral Coefficients` are numbers which essentially represent the fingerprint of this `Mel Spectrogram` heatmap

You start with a spectrogram and compress it down to just the key identifying characteristics.
If the Mel Spectrogram were a painting, then the MFCCs would be the brief summary of that painting.
They’re often used together — first you make the Mel Spectrogram, then extract the MFCCs to feed into AI or voice recognition tools.

The `chromagram` will represent the rhythm and melody of speech.

Emotions change pitch patterns in speech.

A happy or excited person might use a wider range of notes, and jump between them more.
A sad person might stay within a narrow, low-pitched range.
An angry person might use sharp, spiky changes in pitch.

The chromagram captures those changes — kind of like a musical fingerprint of how someone’s voice moves.

`Spectral contrast` represents the texture of the sound.

This is used in emotion detection to measure how smooth or rough the sound is to tell if something is calm or intense.

Spectral contrast measures the difference between:

The loudest parts of a sound (called the peaks)
And the quietest parts (called the valleys)

In different pitch ranges (low, mid, high).

Think of `Tonnetz` as a map of how emotional tone shifts in speech:

Angry speech might jump between harsh, dissonant pitches
Sad speech might linger around soft, close, mellow tones
Happy or excited speech might bounce through more energetic, “harmonically rich” patterns

Tonnetz doesn’t care about the exact notes — it cares about how those notes relate to each other. That relationship can help AI detect the emotion in someone’s voice.

2. Creature Feature Matrix

These features have all been chosen based on extensive research of audio analysis and speech recognition techniques. The next task is to combine them and create a feature matrix which will be fed into a deep learning model.

3. Modelling

For testing deep learning models, Google Colab was utilized for increased GPU performance to speed up training times.

After testing a variety of models, the Bidirectional LSTM network performed the best.
However, results were still lackluster, as overfitting was a serious problem.
Final accuracy (63) and F1 score (62) did not meet expectations, I would not recommend this model for production purposes.

Better noise cancellation needs to be considered, as well as more data from other sources which me be a better source of natural speech.

Network topology needs to be reworked to better learn these features. RNNs might be the better choice in this case.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
data		data
images		images
notebooks		notebooks
.gitignore		.gitignore
README.md		README.md
presentation.pdf		presentation.pdf
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Speech Emotion Recognition

Problem Statement

Background

Frameworks to Consider

Notes:

1. Data Dictionary

A `Mel Spectrogram` is a very easy to understand heatmap of pitch, intensity, and time, and it’s scaled to represent a human ear's sensitivity to these sounds.

`Mel Frequency Cepstral Coefficients` are numbers which essentially represent the fingerprint of this `Mel Spectrogram` heatmap

The `chromagram` will represent the rhythm and melody of speech.

`Spectral contrast` represents the texture of the sound.

Think of `Tonnetz` as a map of how emotional tone shifts in speech:

2. Creature Feature Matrix

3. Modelling

4. requirements.txt

5. Resources

About

Uh oh!

Releases

Packages

Languages

CharlesCro/ser-ravdess

Folders and files

Latest commit

History

Repository files navigation

Speech Emotion Recognition

Problem Statement

Background

Frameworks to Consider

Notes:

1. Data Dictionary

A Mel Spectrogram is a very easy to understand heatmap of pitch, intensity, and time, and it’s scaled to represent a human ear's sensitivity to these sounds.

Mel Frequency Cepstral Coefficients are numbers which essentially represent the fingerprint of this Mel Spectrogram heatmap

The chromagram will represent the rhythm and melody of speech.

Spectral contrast represents the texture of the sound.

Think of Tonnetz as a map of how emotional tone shifts in speech:

2. Creature Feature Matrix

3. Modelling

4. requirements.txt

5. Resources

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

A `Mel Spectrogram` is a very easy to understand heatmap of pitch, intensity, and time, and it’s scaled to represent a human ear's sensitivity to these sounds.

`Mel Frequency Cepstral Coefficients` are numbers which essentially represent the fingerprint of this `Mel Spectrogram` heatmap

The `chromagram` will represent the rhythm and melody of speech.

`Spectral contrast` represents the texture of the sound.

Think of `Tonnetz` as a map of how emotional tone shifts in speech:

Packages