Skip to content

Speech Emotion Recognition Using 1D CNN with Channel and Spatial Attention Mechanisms on Augmented MFCC Features

License

Notifications You must be signed in to change notification settings

spilabkorea/ser

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

79 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Speech Emotion Recognition

Introduction

🎙️ Speech Emotion Recognition (SER) using MFCCs and 1D-CNN with attention mechanisms. 🚀 Achieves state-of-the-art accuracy across six benchmark datasets (SAVEE, RAVDESS, TESS, etc.). 🧠 Robust, generalizable, and optimized for real-world human-computer interaction and assistive tech.

Highlights 🎯

We focused on high-accuracy multilingual speech recognition, speech emotion recognition. Trained on six benchmark datasets and achieved state-of-the-art performance compared to previous speech emotion recognition models on the same datasets.

Requirements

  • Python 3.8+

Python Packages

  • Tensorflow
  • librosa==0.6.3
  • numpy
  • pandas
  • scikit-learn==0.24.2
  • tqdm==4.28.1
  • matplotlib==2.2.3

Benchmark Datasets 📝

We performed the epxeriments on six benchmark dataset. The download link for these datasets in given below.

  • RAVDESS contains 1440 files: 60 trials per actor x 24 actors = 1440. The RAVDESS contains 24 professional actors (12 female, 12 male), vocalizing two lexically-matched statements in a neutral North American accent. Speech emotions includes calm, happy, sad, angry, fearful, surprise, and disgust expressions. Each expression is produced at two levels of emotional intensity (normal, strong), with an additional neutral expression.
  • CREMA-D contains 7,442 original clips from 91 actors. These clips were from 48 male and 43 female actors between the ages of 20 and 74 coming from a variety of races and ethnicities (African America, Asian, Caucasian, Hispanic, and Unspecified). Actors spoke from a selection of 12 sentences. The sentences were presented using one of six different emotions (Anger, Disgust, Fear, Happy, Neutral, and Sad) and four different emotion levels (Low, Medium, High, and Unspecified).
  • SAVEE recorded from four native English male speakers (identified as DC, JE, JK, KL), postgraduate students and researchers at the University of Surrey aged from 27 to 31 years. Emotion has been described psychologically in discrete categories: anger, disgust, fear, happiness, sadness and surprise. A neutral category is also added to provide recordings of 7 emotion categories.
  • TESS contains 200 target words were spoken in the carrier phrase "Say the word _' by two actresses (aged 26 and 64 years) and recordings were made of the set portraying each of seven emotions (anger, disgust, fear, happiness, pleasant surprise, sadness, and neutral). There are 2800 data points (audio files) in total.
  • EMOVO is first emotional corpus applicable to the Italian language. It is a database built from the voices of 6 actors (3 males and 3 females) who played 14 sentences simulating 6 emotional states (disgust, fear, anger, joy, surprise, sadness) plus the neutral state.
  • EmoDB database is created by the Institute of Communication Science, Technical University, Berlin, Germany. Ten professional speakers (five males and five females) participated in data recording. The database contains a total of 535 utterances. The EMODB database comprises of seven emotions: 1) anger; 2) boredom; 3) anxiety; 4) happiness; 5) sadness; 6) disgust; and 7) neutral. The data was recorded at a 48-kHz sampling rate and then down-sampled to 16-kHz.

Emotional Classification Performance

Given the current absence of standardized benchmarks and methods in speech emotion recognition, we conducted extensive evaluations across multiple test sets and performed a thorough comparison with recent state-of-the-art results. We used 6 datasets including RAVDESS, TESS, SAVEE, EMO-DB, CREMA-D, and EMOVO.

Feature Extraction

Feature extraction is the main part of the speech emotion recognition system. It is basically accomplished by changing the speech waveform to a form of parametric representation at a relatively lesser data rate In this repository, we have used the following features

  • MFCC
  • Chorma

Installation

To run the code, follow these steps:

Clone the repository

git@github.com:spilabkorea/ser.git

Install the libraries by the following command

conda env create -f environment.yml

features extraction

To extract the MFCC and Chroma features run

preprocessing files

Train the models for the model trainining run the training scripts

Citation

@article{lee2025toward,
  title={Toward Efficient Speech Emotion Recognition via Spectral Learning and Attention},
  author={Lee, HyeYoung and Nadeem, Muhammad},
  journal={arXiv preprint arXiv:2507.03251},
  year={2025}
}

About

Speech Emotion Recognition Using 1D CNN with Channel and Spatial Attention Mechanisms on Augmented MFCC Features

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published