Speech Emotion Recognition

Speech emotion recognition using LSTM, CNN, SVM and MLP, implemented in Keras.

We have improved the feature extracting method and achieved higher accuracy (about 80%). The original version is backed up under First-Version branch.

English Document | 中文文档

Environments

Python 3.8
Keras & TensorFlow 2

Structure

├── models/                // models
│   ├── common.py          // base class for all models
│   ├── dnn                // neural networks
│   │   ├── dnn.py         // base class for all neural networks models
│   │   ├── cnn.py         // CNN
│   │   └── lstm.py        // LSTM
│   └── ml.py              // SVM & MLP
├── extract_feats/         // features extraction
│   ├── librosa.py         // extract features using librosa
│   └── opensmile.py       // extract features using Opensmile
├── utils/
│   ├── files.py           // setup dataset (classify and rename)
│   ├── opts.py            // argparse
│   └── plot.py            // plot graphs
├── config/                // configure hyper parameters (.yaml)
├── features/              // store extracted features
├── checkpoints/           // store model weights
├── train.py               // train
├── predict.py             // recognize the emotion of a given audio
└── preprocess.py          // data preprocessing (extract features and store them locally)

Requirments

Python

TensorFlow 2 / Keras: LSTM & CNN (tensorflow.keras)
scikit-learn: SVM & MLP, split data into training set and testing set
joblib：save and load models trained by scikit-learn
librosa: extract features, waveform
SciPy: spectrogram
pandas: load features
Matplotlib: plot graphs
NumPy

Tools

[Optional] Opensmile: extract features

Datasets

RAVDESS

English, around 1500 audios from 24 people (12 male and 12 female) including 8 different emotions (the third number of the file name represents the emotional type): 01 = neutral, 02 = calm, 03 = happy, 04 = sad, 05 = angry, 06 = fearful, 07 = disgust, 08 = surprised.
SAVEE

English, around 500 audios from 4 people (male) including 7 different emotions (the first letter of the file name represents the emotional type): a = anger, d = disgust, f = fear, h = happiness, n = neutral, sa = sadness, su = surprise.
EMO-DB

German, around 500 audios from 10 people (5 male and 5 female) including 7 different emotions (the second to last letter of the file name represents the emotional type): N = neutral, W = angry, A = fear, F = happy, T = sad, E = disgust, L = boredom.
CASIA

Chinese, around 1200 audios from 4 people (2 male and 2 female) including 6 different emotions: neutral, happy, sad, angry, fearful and surprised.

Usage

Prepare

Install dependencies:

pip install -r requirements.txt

(Optional) Install Opensmile.

Configuration

Parameters can be configured in the config files (YAML) under configs/.

It should be noted that, currently only the following 6 Opensmile standard feature sets are supported:

IS09_emotion: The INTERSPEECH 2009 Emotion Challenge, 384 features;
IS10_paraling: The INTERSPEECH 2010 Paralinguistic Challenge, 1582 features;
IS11_speaker_state: The INTERSPEECH 2011 Speaker State Challenge, 4368 features;
IS12_speaker_trait: The INTERSPEECH 2012 Speaker Trait Challenge, 6125 features;
IS13_ComParE: The INTERSPEECH 2013 ComParE Challenge, 6373 features;
ComParE_2016: The INTERSPEECH 2016 Computational Paralinguistics Challenge, 6373 features.

You may should modify item FEATURE_NUM in extract_feats/opensmile.py if you want to use other feature sets.

Preprocess

First of all, you should extract features of each audio in dataset and store them locally. Features extracted by Opensmile will be saved in .csv files and by librosa will be saved in .p files.

python preprocess.py --config configs/example.yaml

where configs/test.yaml is the path to your config file

Train

The path of the datasets can be configured in configs/. Audios which express the same emotion should be put in the same folder (you may want to refer to utils/files.py when setting up datasets), for example:

└── datasets
    ├── angry
    ├── happy
    ├── sad
    ...

Then:

python train.py --config configs/example.yaml

Predict

This is for when you have trained a model and want to predict the emotion for an audio. Check out checkpoints/ for some checkpoints.

First modify following things in predict.py:

audio_path = 'str: path_to_your_audio'

Then:

python predict.py --config configs/example.yaml

Functions

Radar Chart

Plot a radar chart for demonstrating predicted probabilities.

Source: Radar

import utils

"""
Args:
    data_prob (np.ndarray): probabilities
    class_labels (list): labels
"""
utils.radar(data_prob, class_labels)

Play Audio

import utils

utils.play_audio(file_path)

Plot Curve

Plot loss curve or accuracy curve.

import utils

"""
Args:
    train (list): loss or accuracy on train set
    val (list): loss or accuracy on validation set
    title (str): title of figure
    y_label (str): label of y axis
"""
utils.curve(train, val, title, y_label)

Waveform

Plot a waveform for an audio file.

import utils

utils.waveform(file_path)

Spectrogram

Plot a spectrogram for an audio file.

import utils

utils.spectrogram(file_path)

Other Contributors

@Zhaofan-Su
@Guo Hui

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README_EN.md

README_EN.md

Speech Emotion Recognition

Environments

Structure

Requirments

Python

Tools

Datasets

Usage

Prepare

Configuration

Preprocess

Train

Predict

Functions

Radar Chart

Play Audio

Plot Curve

Waveform

Spectrogram

Other Contributors

Files

README_EN.md

Latest commit

History

README_EN.md

File metadata and controls

Speech Emotion Recognition

Environments

Structure

Requirments

Python

Tools

Datasets

Usage

Prepare

Configuration

Preprocess

Train

Predict

Functions

Radar Chart

Play Audio

Plot Curve

Waveform

Spectrogram

Other Contributors