This repository contains the code I used to train and evaluate (most of) the models described in Combining Residual Networks with LSTMs for Lipreading by T. Stafylakis and G. Tzimiropoulos
The code is based on facebook's implementation of ResNets
A Pytorch version of the code is now available, together with pretrained models for Visual, Audio and AudioVisual word recognition on the same database. You can find them here. They are based on the End-to-end Audiovisual Speech Recognition ICASSP-2018 paper we co-authored with S. Petridis, P. Ma, F Cai, and M. Pantic from Imperial College London.
See the installation instructions for a step-by-step guide.
- Install Torch on a machine with CUDA GPU
- Install cuDNN v4 or v5 and the Torch cuDNN bindings
- Install rnn (not tested with more recent versions).
- Download the Lip Reading in the Wild dataset
The training scripts come with several options, which can be listed with the --help
flag.
This is the suggested order to train the models:
(i) Start by training a model with temporal convolutional backend (set -netType 'temp_conv'
). Set -LR 0.003
and let it for about 30 epochs.
(ii) Throw away the temporal convolutional backend, freeze the parameters of the frontend and the ResNet and train the LSTM backend (set -netType 'LSTM_init'
). 5 epochs are enough to get a sensible initialization of the LSTM. Set -LR 0.003
(iii) Train the whole network end-to-end (set -netType 'LSTM'
). In this case, set -LR 0.0005
and about 30 epochs.
The (i) should yield about 25% error rate and (iii) about 17%.
All these steps are performed (semi-)automatically by the code. You should (a) change the netType
and LR
parameters and (b) set the retrain
parameter to the path where the previous model is stored. For (i), set retrain
to none
.
I used a single GPU without any of the memory optimization methods of the original ResNet (e.g. shareGradInput, optnet). In case you want to evaluate on CPU, you should convert cudnn modules to the corresponding nn (which support CPU). To do so, use convert.lua function.
Please send me an email at themos.stafylakis@nottingham.ac.uk or at themosst@gmail.com.
In fast_evaluation you will find evaluate_examples.lua, together with some files (in torch format) from LRW and its vocabulary (500 words). Run the script and verify that (at least most of) the 5 examples are correctly classified. The .t7 files are also useful in order to check how the input of the ResNet should look like.
The number of frames per clip is 29. In the paper we refer to 31 because I used an older version of ffmpeg
to extract images, that (for some unknown reason) prepends two copies of the first frame.
The initial learning rate is ideal for the particular batch size. If you decide the reduce the batch size (e.g. due to GPU memory limitations) you should reduce the learning rate too, overwise the algorithm will never converge.
In my original implementation I used landmark detection, based on which I was estimating the boundaries of the mouth region. However, one can skip this step and crop the frames using a fixed window (see datasets/BBCnet.lua
) since the faces are already centered.
In the paper I used the 34-ResNet, although 18-ResNet performs equally well. You can play a bit with other parameters, such as inputDim and hiddenDim, or the activation function.
Moreover, it would be interesting to try batchnorm and/or dropouts in the BiLSTM.
I use one SoftMax per BiLSTM output, but I have also tried using average pooling combined with a single SoftMax. I tried SoftMax on the last frame as well (the latter did not do well with unidirection LSTM, but was OK with BiLSTMs). I didn't notice any substantial difference between the three approaches.
Currently, the models do not make use of the word boundaries that are provided with the dataset. However, I will soon upload code that makes use of them. The performance is about 12.7% error rate, compared to 17.0%.
It is largely based on this code, with some differences mainly on the backend and on the use of word boundaries.