The goal of this pipeline is to test the effectiveness of combining audio features for the purpose of Speech Emotion Recognition (SER)
Utilizing the RAVDESS dataset, I will examine a neural networks ability to classify the emotion of a sentence based on the feature matrix extracted from this test pipeline.
The model's performance will be evaluated by its F1 Score.
Speech Emotion Recognition is a task of speech processing and computational paralinguistics that aims to recognize and categorize the emotions expressed in spoken language. The goal is to determine the emotional state of a speaker, such as happiness, anger, sadness, or frustration, from their speech patterns, such as prosody, pitch, and rhythm.
The RAVDESS dataset contains 24 professional actors (12 female, 12 male), vocalizing two lexically-matched statements in a neutral North American accent.
Speech includes calm, happy, sad, angry, fearful, surprise, and disgust expressions.
Each expression is produced at two levels of emotional intensity (normal, strong), with an additional neutral expression.
| Feature / Library | torchaudio | librosa | pyDub | scipy.signal | speechbrain |
|---|---|---|---|---|---|
| Primary Use Case | Deep learning with audio (PyTorch) | Music/audio feature extraction | Audio editing/conversion | Signal processing | Speech processing with pretrained models |
| Framework Integration | Native PyTorch support | Independent | Independent | Independent | Built on top of PyTorch |
| Feature Extraction | MFCC, Mel Spectrogram, etc. | MFCC, chroma, tonnetz, tempo | Limited (only raw audio handling) | Limited (manual DSP) | Yes (MFCC, Mel, etc.) |
| I/O Formats | WAV, MP3, FLAC, etc. | WAV (via soundfile), MP3 (ffmpeg) | MP3, WAV, etc. (via ffmpeg) | WAV (via wavfile) |
WAV, MP3, FLAC (via torchaudio) |
| Transform Pipeline | Torch-based transforms | NumPy-based | Not designed for pipelines | Manual chaining of operations | Prebuilt and custom modules |
| Real-Time Audio | Basic support | No | No | No | Limited (depends on custom setup) |
| Pretrained Models | Yes (e.g., Wav2Vec2, HuBERT) | No | No | No | Yes (ASR, speaker ID, etc.) |
| Visualization Tools | Limited | Extensive (waveforms, spectrograms) | Limited | Basic (plotting with matplotlib) | Minimal (integration with matplotlib) |
| Ease of Use | Medium (needs PyTorch knowledge) | Easy | Very Easy | Medium | Medium |
| License | BSD-Style | ISC | MIT | BSD | Apache 2.0 |
| GPU Acceleration | Yes (via PyTorch) | No | No | No | Yes (via PyTorch) |
- torchaudio is ideal for deep learning workflows, especially with PyTorch.
- librosa is excellent for traditional audio analysis and music research.
- pyDub is great for basic audio file manipulation (concatenation, conversion).
- scipy.signal is suitable for manual signal processing operations.
- speechbrain is focused on building and using pretrained speech models (e.g., ASR, speaker diarization).
A Mel Spectrogram is a very easy to understand heatmap of pitch, intensity, and time, and it’s scaled to represent a human ear's sensitivity to these sounds.
-
The left to right axis is time — like watching a video frame by frame.
-
The up and down axis is pitch — low sounds are at the bottom, high sounds are at the top.
-
The color or brightness shows how strong each sound is — brighter or darker spots mean louder or softer sounds at that pitch and moment.
Mel Frequency Cepstral Coefficients are numbers which essentially represent the fingerprint of this Mel Spectrogram heatmap
-
You start with a spectrogram and compress it down to just the key identifying characteristics.
-
If the
Mel Spectrogramwere a painting, then theMFCCswould be the brief summary of that painting. -
They’re often used together — first you make the
Mel Spectrogram, then extract theMFCCsto feed into AI or voice recognition tools.
Emotions change pitch patterns in speech.
-
A happy or excited person might use a wider range of notes, and jump between them more.
-
A sad person might stay within a narrow, low-pitched range.
-
An angry person might use sharp, spiky changes in pitch.
The chromagram captures those changes — kind of like a musical fingerprint of how someone’s voice moves.
This is used in emotion detection to measure how smooth or rough the sound is to tell if something is calm or intense.
Spectral contrast measures the difference between:
-
The loudest parts of a sound (called the peaks)
-
And the quietest parts (called the valleys)
In different pitch ranges (low, mid, high).
-
Angry speech might jump between harsh, dissonant pitches
-
Sad speech might linger around soft, close, mellow tones
-
Happy or excited speech might bounce through more energetic, “harmonically rich” patterns
Tonnetz doesn’t care about the exact notes — it cares about how those notes relate to each other. That relationship can help AI detect the emotion in someone’s voice.
These features have all been chosen based on extensive research of audio analysis and speech recognition techniques. The next task is to combine them and create a feature matrix which will be fed into a deep learning model.
3. Modelling
For testing deep learning models, Google Colab was utilized for increased GPU performance to speed up training times.
-
After testing a variety of models, the
Bidirectional LSTMnetwork performed the best. -
However, results were still lackluster, as overfitting was a serious problem.
-
Final
accuracy(63) andF1 score(62) did not meet expectations, I would not recommend this model for production purposes.
Better noise cancellation needs to be considered, as well as more data from other sources which me be a better source of natural speech.
Network topology needs to be reworked to better learn these features. RNNs might be the better choice in this case.




