Skip to content

taylorlu/voicepuppet

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

14 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

VoicePuppet

  • This repository provided a common pipeline to generate speaking actor by voice input automatically.
  • For a better feeling, there's a short video to demostrate it.

The archecture of the network

  • Composed of 2 parts, one for predict 3D face coeffcients of each frame align to a certain stride window size of waveform, called BFMNet(basel face model network). The another for redraw the real face foreground using the rasterized face which produced by the rendered 3D face coeffcients of previous step, called PixReferNet.
BFMNet component
PixReferNet component

Run the prediction pipeline


  1. Download the pretrained model and required models.
    Baidu Disk: [ckpt.zip, code: a6pn], [allmodels.zip, code: brfh]
    or Google Drive: [ckpt.zip], [allmodels.zip]
    Extract the ckpt.zip to ckpt_bfmnet and ckpt_pixrefer, extract the allmodels.zip to current root dir
  2. cd utils/cython && python3 setup.py install
  3. Install ffmpeg tool if you want to merge the png sequence and audio file to video container like mp4.
  4. python3 voicepuppet/pixrefer/infer_bfmvid.py --config_path config/params.yml sample/22.jpg sample/test.aac

Run the training pipeline


Requirements

  • tensorflow>=1.14.0
  • pytorch>=1.4.0, only for data preparation (face foreground segmentation and matting)
  • mxnet>=1.5.1, only for data preparation (face alignment) tips: you can use other models to do the same label marking instead, such as dlib

Data preparation

  1. Check your config/params.yml to make sure the dataset folder in specified structure (same as the grid dataset, you can extend the dataset by using the same folder structure which contains common video files)
|- srcdir/
|    |- s10/
|        |- video/
|            |- mpg_6000/
|                |- bbab8n.mpg
|                |- bbab9s.mpg
|                |- bbac1a.mpg
|                |- ...
|    |- s8/
|        |- video/
|            |- mpg_6000/
|                |- bbae5n.mpg
|                |- bbae6s.mpg
|                |- bbae7p.mpg
|                |- ...
  1. Extract audio stream from mpg video file, todir was a output folder which you want to store the labels.
    python3 datasets/make_data_from_GRID.py --gpu 0 --step 2 srcdir todir

  2. Face detection and alignment
    python3 datasets/make_data_from_GRID.py --gpu 0 --step 3 srcdir todir ./allmodels

  3. 3D face reconstruction
    python3 datasets/make_data_from_GRID.py --gpu 0 --step 4 todir ./allmodels

  4. It will take several hours to finish the above steps, subsequently, you'll find there's *.jpg, landmark.txt, audio.wav, bfmcoeff.txt in each output subfolder. The above labels(audio.wav, bfmcoeff.txt) are used for BFMNet training, the others are only temp files.

|- todir/
|    |- s10/
|        |- bbab8n/
|            |- landmark.txt
|            |- audio.wav
|            |- bfmcoeff.txt
|            |- 0.jpg
|            |- 1.jpg
|            |- ...
|        |- bbab9s/
|            |- ...
|    |- s8/
|        |- bbae5n/
|            |- landmark.txt
|            |- audio.wav
|            |- bfmcoeff.txt
|            |- 0.jpg
|            |- 1.jpg
|            |- ...
|        |- bbae6s/
|            |- ...
  1. Face(human foreground) segmentation and matting for PixelReferNet training. Before invoke the python shell, you should make sure the width and height of the video was in the same size(1:1). In general, 3-5 minutes video was enough for training the PixelReferNet network, the trained model will only take effect on this specified person too.
    python3 datasets/make_data_from_GRID.py --gpu 0 --step 6 src_dir to_dvp_dir ./allmodels
    the src_dir has the same folder structure as [tip1 in Data preparation], when finish the above step, you will find *.jpg in subfolders, like this

Train BFMNet

  1. Prepare train and eval txt, check the root_path parameter in config/params.yml is the output folder of [tip1 in Data preparation]
    python3 datasets/makelist_bfm.py --config_path config/params.yml
  2. train the model
    python3 voicepuppet/bfmnet/train_bfmnet.py --config_path config/params.yml
  3. Watch the evalalute images every 1000 step in log/eval_bfmnet, the upper was the target sequence, and the under was the evaluated sequence.

Train PixReferNet

  1. Prepare train and eval txt, check the root_path parameter in config/params.yml is the output folder of [tip6 in Data preparation]
    python3 datasets/makelist_pixrefer.py --config_path config/params.yml
  2. train the model
    python3 voicepuppet/pixrefer/train_pixrefer.py --config_path config/params.yml
  3. Use tensorboard to watch the training process
    tensorboard --logdir=log/summary_pixrefer

Acknowledgement

  1. The face alignment model was refer to Deepinx's work, it's more stable than Dlib.
  2. 3D face reconstruction model was refer to microsoft's work
  3. Image segmentation model was refer to gasparian's work
  4. Image matting model was refer to foamliu's work