- This repository provided a common pipeline to generate speaking actor by voice input automatically.
- For a better feeling, there's a short video to demostrate it.
- Composed of 2 parts, one for predict 3D face coeffcients of each frame align to a certain stride window size of waveform, called BFMNet(basel face model network). The another for redraw the real face foreground using the rasterized face which produced by the rendered 3D face coeffcients of previous step, called PixReferNet.
BFMNet component |
---|
PixReferNet component |
- Download the pretrained model and required models.
Baidu Disk: [ckpt.zip, code: a6pn], [allmodels.zip, code: brfh]
or Google Drive: [ckpt.zip], [allmodels.zip]
Extract theckpt.zip
tockpt_bfmnet
andckpt_pixrefer
, extract theallmodels.zip
to current root dir cd utils/cython
&&python3 setup.py install
- Install ffmpeg tool if you want to merge the png sequence and audio file to video container like mp4.
python3 voicepuppet/pixrefer/infer_bfmvid.py --config_path config/params.yml sample/22.jpg sample/test.aac
- tensorflow>=1.14.0
- pytorch>=1.4.0, only for data preparation (face foreground segmentation and matting)
- mxnet>=1.5.1, only for data preparation (face alignment) tips: you can use other models to do the same label marking instead, such as dlib
- Check your
config/params.yml
to make sure the dataset folder in specified structure (same as the grid dataset, you can extend the dataset by using the same folder structure which contains common video files)
|- srcdir/
| |- s10/
| |- video/
| |- mpg_6000/
| |- bbab8n.mpg
| |- bbab9s.mpg
| |- bbac1a.mpg
| |- ...
| |- s8/
| |- video/
| |- mpg_6000/
| |- bbae5n.mpg
| |- bbae6s.mpg
| |- bbae7p.mpg
| |- ...
-
Extract audio stream from mpg video file,
todir
was a output folder which you want to store the labels.
python3 datasets/make_data_from_GRID.py --gpu 0 --step 2 srcdir todir
-
Face detection and alignment
python3 datasets/make_data_from_GRID.py --gpu 0 --step 3 srcdir todir ./allmodels
-
3D face reconstruction
python3 datasets/make_data_from_GRID.py --gpu 0 --step 4 todir ./allmodels
-
It will take several hours to finish the above steps, subsequently, you'll find there's
*.jpg, landmark.txt, audio.wav, bfmcoeff.txt
in each output subfolder. The above labels(audio.wav
,bfmcoeff.txt
) are used for BFMNet training, the others are only temp files.
|- todir/
| |- s10/
| |- bbab8n/
| |- landmark.txt
| |- audio.wav
| |- bfmcoeff.txt
| |- 0.jpg
| |- 1.jpg
| |- ...
| |- bbab9s/
| |- ...
| |- s8/
| |- bbae5n/
| |- landmark.txt
| |- audio.wav
| |- bfmcoeff.txt
| |- 0.jpg
| |- 1.jpg
| |- ...
| |- bbae6s/
| |- ...
- Face(human foreground) segmentation and matting for PixelReferNet training. Before invoke the python shell, you should make sure the width and height of the video was in the same size(1:1). In general, 3-5 minutes video was enough for training the PixelReferNet network, the trained model will only take effect on this specified person too.
python3 datasets/make_data_from_GRID.py --gpu 0 --step 6 src_dir to_dvp_dir ./allmodels
thesrc_dir
has the same folder structure as [tip1 in Data preparation], when finish the above step, you will find*.jpg
in subfolders, like this
- Prepare train and eval txt, check the
root_path
parameter inconfig/params.yml
is the output folder of [tip1 in Data preparation]
python3 datasets/makelist_bfm.py --config_path config/params.yml
- train the model
python3 voicepuppet/bfmnet/train_bfmnet.py --config_path config/params.yml
- Watch the evalalute images every 1000 step in
log/eval_bfmnet
, the upper was the target sequence, and the under was the evaluated sequence.
- Prepare train and eval txt, check the
root_path
parameter inconfig/params.yml
is the output folder of [tip6 in Data preparation]
python3 datasets/makelist_pixrefer.py --config_path config/params.yml
- train the model
python3 voicepuppet/pixrefer/train_pixrefer.py --config_path config/params.yml
- Use tensorboard to watch the training process
tensorboard --logdir=log/summary_pixrefer
- The face alignment model was refer to Deepinx's work, it's more stable than Dlib.
- 3D face reconstruction model was refer to microsoft's work
- Image segmentation model was refer to gasparian's work
- Image matting model was refer to foamliu's work