VoicePuppet

This repository provided a common pipeline to generate speaking actor by voice input automatically.
For a better feeling, there's a short video to demostrate it.

The archecture of the network

Composed of 2 parts, one for predict 3D face coeffcients of each frame align to a certain stride window size of waveform, called BFMNet(basel face model network). The another for redraw the real face foreground using the rasterized face which produced by the rendered 3D face coeffcients of previous step, called PixReferNet.


BFMNet component

PixReferNet component

Run the prediction pipeline

Download the pretrained model and required models.
Baidu Disk: [ckpt.zip, code: a6pn], [allmodels.zip, code: brfh]
or Google Drive: [ckpt.zip], [allmodels.zip]
Extract the ckpt.zip to ckpt_bfmnet and ckpt_pixrefer, extract the allmodels.zip to current root dir
cd utils/cython && python3 setup.py install
Install ffmpeg tool if you want to merge the png sequence and audio file to video container like mp4.
python3 voicepuppet/pixrefer/infer_bfmvid.py --config_path config/params.yml sample/22.jpg sample/test.aac

Run the training pipeline

Requirements

tensorflow>=1.14.0
pytorch>=1.4.0, only for data preparation (face foreground segmentation and matting)
mxnet>=1.5.1, only for data preparation (face alignment) tips: you can use other models to do the same label marking instead, such as dlib

Data preparation

Check your config/params.yml to make sure the dataset folder in specified structure (same as the grid dataset, you can extend the dataset by using the same folder structure which contains common video files)

|- srcdir/
|    |- s10/
|        |- video/
|            |- mpg_6000/
|                |- bbab8n.mpg
|                |- bbab9s.mpg
|                |- bbac1a.mpg
|                |- ...
|    |- s8/
|        |- video/
|            |- mpg_6000/
|                |- bbae5n.mpg
|                |- bbae6s.mpg
|                |- bbae7p.mpg
|                |- ...

Extract audio stream from mpg video file, todir was a output folder which you want to store the labels.
python3 datasets/make_data_from_GRID.py --gpu 0 --step 2 srcdir todir
Face detection and alignment
python3 datasets/make_data_from_GRID.py --gpu 0 --step 3 srcdir todir ./allmodels
3D face reconstruction
python3 datasets/make_data_from_GRID.py --gpu 0 --step 4 todir ./allmodels
It will take several hours to finish the above steps, subsequently, you'll find there's *.jpg, landmark.txt, audio.wav, bfmcoeff.txt in each output subfolder. The above labels(audio.wav, bfmcoeff.txt) are used for BFMNet training, the others are only temp files.

|- todir/
|    |- s10/
|        |- bbab8n/
|            |- landmark.txt
|            |- audio.wav
|            |- bfmcoeff.txt
|            |- 0.jpg
|            |- 1.jpg
|            |- ...
|        |- bbab9s/
|            |- ...
|    |- s8/
|        |- bbae5n/
|            |- landmark.txt
|            |- audio.wav
|            |- bfmcoeff.txt
|            |- 0.jpg
|            |- 1.jpg
|            |- ...
|        |- bbae6s/
|            |- ...

Face(human foreground) segmentation and matting for PixelReferNet training. Before invoke the python shell, you should make sure the width and height of the video was in the same size(1:1). In general, 3-5 minutes video was enough for training the PixelReferNet network, the trained model will only take effect on this specified person too.
python3 datasets/make_data_from_GRID.py --gpu 0 --step 6 src_dir to_dvp_dir ./allmodels
the src_dir has the same folder structure as [tip1 in Data preparation], when finish the above step, you will find *.jpg in subfolders, like this

Train BFMNet

Prepare train and eval txt, check the root_path parameter in config/params.yml is the output folder of [tip1 in Data preparation]
python3 datasets/makelist_bfm.py --config_path config/params.yml
train the model
python3 voicepuppet/bfmnet/train_bfmnet.py --config_path config/params.yml
Watch the evalalute images every 1000 step in log/eval_bfmnet, the upper was the target sequence, and the under was the evaluated sequence.

Train PixReferNet

Prepare train and eval txt, check the root_path parameter in config/params.yml is the output folder of [tip6 in Data preparation]
python3 datasets/makelist_pixrefer.py --config_path config/params.yml
train the model
python3 voicepuppet/pixrefer/train_pixrefer.py --config_path config/params.yml
Use tensorboard to watch the training process
tensorboard --logdir=log/summary_pixrefer

Acknowledgement

The face alignment model was refer to Deepinx's work, it's more stable than Dlib.
3D face reconstruction model was refer to microsoft's work
Image segmentation model was refer to gasparian's work
Image matting model was refer to foamliu's work

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
background		background
config		config
datasets		datasets
generator		generator
res		res
sample		sample
utils		utils
voicepuppet		voicepuppet
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
WORKSPACE		WORKSPACE

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

VoicePuppet

The archecture of the network

Run the prediction pipeline

Run the training pipeline

Requirements

Data preparation

Train BFMNet

Train PixReferNet

Acknowledgement

About

Releases

Packages

Languages

License

taylorlu/voicepuppet

Folders and files

Latest commit

History

Repository files navigation

VoicePuppet

The archecture of the network

Run the prediction pipeline

Run the training pipeline

Requirements

Data preparation

Train BFMNet

Train PixReferNet

Acknowledgement

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages