Speech emotion recognition using LSTM, CNN, SVM and MLP, implemented in Keras.
We have improved the feature extracting method and achieved higher accuracy (about 80%). The original version is backed up under First-Version branch.
English Document | 中文文档
- Python 3.8
- Keras & TensorFlow 2
├── models/ // models
│ ├── common.py // base class for all models
│ ├── dnn // neural networks
│ │ ├── dnn.py // base class for all neural networks models
│ │ ├── cnn.py // CNN
│ │ └── lstm.py // LSTM
│ └── ml.py // SVM & MLP
├── extract_feats/ // features extraction
│ ├── librosa.py // extract features using librosa
│ └── opensmile.py // extract features using Opensmile
├── utils/
│ ├── files.py // setup dataset (classify and rename)
│ ├── opts.py // argparse
│ └── plot.py // plot graphs
├── config/ // configure hyper parameters (.yaml)
├── features/ // store extracted features
├── checkpoints/ // store model weights
├── train.py // train
├── predict.py // recognize the emotion of a given audio
└── preprocess.py // data preprocessing (extract features and store them locally)
- TensorFlow 2 / Keras: LSTM & CNN (
tensorflow.keras
) - scikit-learn: SVM & MLP, split data into training set and testing set
- joblib:save and load models trained by scikit-learn
- librosa: extract features, waveform
- SciPy: spectrogram
- pandas: load features
- Matplotlib: plot graphs
- NumPy
- [Optional] Opensmile: extract features
-
English, around 1500 audios from 24 people (12 male and 12 female) including 8 different emotions (the third number of the file name represents the emotional type): 01 = neutral, 02 = calm, 03 = happy, 04 = sad, 05 = angry, 06 = fearful, 07 = disgust, 08 = surprised.
-
English, around 500 audios from 4 people (male) including 7 different emotions (the first letter of the file name represents the emotional type): a = anger, d = disgust, f = fear, h = happiness, n = neutral, sa = sadness, su = surprise.
-
German, around 500 audios from 10 people (5 male and 5 female) including 7 different emotions (the second to last letter of the file name represents the emotional type): N = neutral, W = angry, A = fear, F = happy, T = sad, E = disgust, L = boredom.
-
CASIA
Chinese, around 1200 audios from 4 people (2 male and 2 female) including 6 different emotions: neutral, happy, sad, angry, fearful and surprised.
Install dependencies:
pip install -r requirements.txt
(Optional) Install Opensmile.
Parameters can be configured in the config files (YAML) under configs/
.
It should be noted that, currently only the following 6 Opensmile standard feature sets are supported:
IS09_emotion
: The INTERSPEECH 2009 Emotion Challenge, 384 features;IS10_paraling
: The INTERSPEECH 2010 Paralinguistic Challenge, 1582 features;IS11_speaker_state
: The INTERSPEECH 2011 Speaker State Challenge, 4368 features;IS12_speaker_trait
: The INTERSPEECH 2012 Speaker Trait Challenge, 6125 features;IS13_ComParE
: The INTERSPEECH 2013 ComParE Challenge, 6373 features;ComParE_2016
: The INTERSPEECH 2016 Computational Paralinguistics Challenge, 6373 features.
You may should modify item FEATURE_NUM
in extract_feats/opensmile.py
if you want to use other feature sets.
First of all, you should extract features of each audio in dataset and store them locally. Features extracted by Opensmile will be saved in .csv
files and by librosa will be saved in .p
files.
python preprocess.py --config configs/example.yaml
where configs/test.yaml
is the path to your config file
The path of the datasets can be configured in configs/
. Audios which express the same emotion should be put in the same folder (you may want to refer to utils/files.py
when setting up datasets), for example:
└── datasets
├── angry
├── happy
├── sad
...
Then:
python train.py --config configs/example.yaml
This is for when you have trained a model and want to predict the emotion for an audio. Check out checkpoints/
for some checkpoints.
First modify following things in predict.py
:
audio_path = 'str: path_to_your_audio'
Then:
python predict.py --config configs/example.yaml
Plot a radar chart for demonstrating predicted probabilities.
Source: Radar
import utils
"""
Args:
data_prob (np.ndarray): probabilities
class_labels (list): labels
"""
utils.radar(data_prob, class_labels)
import utils
utils.play_audio(file_path)
Plot loss curve or accuracy curve.
import utils
"""
Args:
train (list): loss or accuracy on train set
val (list): loss or accuracy on validation set
title (str): title of figure
y_label (str): label of y axis
"""
utils.curve(train, val, title, y_label)
Plot a waveform for an audio file.
import utils
utils.waveform(file_path)
Plot a spectrogram for an audio file.
import utils
utils.spectrogram(file_path)