Your approach first extracts 1D facial motion embeddings (local
facial dynamics), and 3D implicit keypoints (global pose, position, and scale) from the driver video. Is there a possibility to substitute this first step with an existing implementation to generate the animation cues from Audio/TTS?
This would for efficient portrait animation with Audio/text+TTS as driver.