Main repository for LipReading with Deep Neural Networks
The goal is to implement LipReading: Similar to how end-to-end Speech Recognition systems work, mapping high-fidelity speech audio to sensible characters and word level outputs, we will do the same for "speech visuals". In particular, we will take video frame input, extract the relevant mouth/chin signals as input to map to characters and words.
- TODO
- Architecture: High level pipeline
- Setup: Quick setup and installation instructions
- SpaCy Setup: Setup for NLP utilities.
- Data Directories Structure: How data files are organized
- Collecting Data: See README_DATA_COLLECTION.md
- Getting Started: Finally get started on running things
- Tutorial on Configuration files: Tutorial on how to run executables via a config file
- Download Data: Collect raw data from Youtube.
- Generate Dataview: Generate dataview from raw data.
- Train Model: 🚋 Train 🚋
- Examples: Example initial configurations to experiment.
- Tensorboard Visualization
- Other Resources: Collection of reading material, and projects
A high level overview of some TODO items. For more project details please see the Github project
- Download Data (926 videos)
- Build Vision Pipeline (1 week) in review
- Build NLP Pipeline (1 week) wip
- Build Loss Fn and Training Pipeline (2 weeks) wip
- Train 🚋 and Ship 🚢 wip
There are two primary interconnected pipelines: a "vision" pipeline for extracting the face and lip features from video frames, along with a "nlp-inspired" pipeline for temporally correlating the sequential lip features into the final output.
Here's a quick dive into tensor dimensionalities
Video -> Frames -> Face Bounding Box Detection -> Face Landmarking
Repr. -> (n, y, x, c) -> (n, (box=1, y_i, x_i, w_i, h_i)) -> (n, (idx=68, y, x))
-> Letters -> Words -> Language Model
-> (chars,) -> (words,) -> (sentences,)
all
: 926 videos (projected, not generated yet)large
: 464 videos (failed at 35/464)medium
: 104 videos (currently at 37/104)small
: 23 videosmicro
: 6 videosnano
: 1 video
- Clone this repository and install the requirements. We will be using
python3
.
Please make sure you run python scripts, setup your PYTHONPATH
at ./
, as well as a workspace env variable.
git clone git@github.com:joseph-zhong/LipReading.git
# (optional, setup venv) cd LipReading; python3 -m venv .
- Once the repository is cloned, the last step for setup is to setup the repository's
PYTHONPATH
and workspace environment variable to take advantage of standardized directory utilities in./src/utils/utility.py
Copy the following into your ~/.bashrc
export PYTHONPATH="$PYTHONPATH:/path/to/LipReading/"
export LIP_READING_WS_PATH="/path/to/LipReading/"
- Install the simple
requirements.txt
withPyTorch
with CTCLoss,SpaCy
, and others.
On MacOS for CPU capabilities only.
pip3 install -r requirements.macos.txt
On Ubuntu, for GPU support
pip3 install -r requirements.ubuntu.txt
We need to install a pre-built English model for some capabilities
python3 -m spacy download en
This allows us to have a simple standardized directory structure for all our datasets, raw data, model weights, logs, etc.
./data/
--/datasets (numpy dataset files for dataloaders to load)
--/raw (raw caption/video files extracted from online sources)
--/weights (model weights, both for training/checkpointing/running)
--/tb (Tensorboard logging)
--/...
See ./src/utils/utility.py
for more.
Now that the dependencies are all setup, we can finally do stuff!
Each of our "standard" scripts in ./src/scripts
(i.e. not ./src/scripts/misc
) take the standard argsparse
-style
arguments. For each of the "standard" scripts, you will be able to pass --help
to see the expected arguments.
To maintain reproducibility, cmdline arguments can be written in a raw text file with one argument per line.
e.g. for ./config/gen_dataview/nano
--inp=StephenColbert/nano
Represent the arguments to pass to ./src/scripts/generate_dataview.py
, automatically passable via
./src/scripts/generate_dataview.py $(cat ./config/gen_dataview/nano)
The arguments will be used from left-to-right order, so if arguments are repeated, they will be overwritten by the latter settings. This allows for modularity in configuring hyperparameters.
(For demonstration purposes, not a working example)
./src/scripts/train.py \
$(cat ./config/dataset/large) \
$(cat ./config/train/model/small-model) \
$(cat ./config/train/model/rnn/lstm) \
...
- Train Model
./src/scripts/train.py
./src/scripts/train_model.py $(cat ./config/train/micro)
This is a collection of external links, papers, projects, and otherwise potentially helpful starting points for the project.
- Lip Reading - Cross Audio-Visual Recognition using 3D Convolutional Neural Networks (Jul. 2017, West Virginia University)
- Lip reading using CNN and LSTM (2017, Stanford)
- LipNet (Dec. 2016, DeepMind)
- Paper: https://arxiv.org/abs/1611.01599
- Original Repo: https://github.com/bshillingford/LipNet
- Working Keras Implementation: https://github.com/rizkiarm/LipNet
- Deep Audio-Visual Speech Recognition (Sept. 2018, DeepMind)
- Lip Reading Sentences in the Wild (Jan. 2017, Deepmind)
- https://arxiv.org/pdf/1611.05358.pdf
- CNN + LSTM encoder, attentive LSTM decoder
- LARGE-SCALE VISUAL SPEECH RECOGNITION (Oct. 2018, DeepMind)
- Lip Reading in Profile (2017, Oxford)
- JOINT CTC-ATTENTION BASED END-TO-END SPEECH RECOGNITION USING MULTI-TASK LEARNING (Jan. 2017, CMU)
- https://arxiv.org/pdf/1609.06773.pdf
- Joint CTC + attention model
- Unofficial implementation
- A Comparison of Sequence-to-Sequence Models for Speech Recognition (2017, Google & Nvidia)
- https://www.isca-speech.org/archive/Interspeech_2017/pdfs/0233.PDF
- CTC vs. attention vs. RNN-transducer vs. RNN-transducer w/ attention
- EXPLORING NEURAL TRANSDUCERS FOR END-TO-END SPEECH RECOGNITION (July 2017, Baidu)
- https://arxiv.org/pdf/1707.07413.pdf
- CTC vs. attention vs. RNN-transducer
- Lip Reading Datasets (Oxford)