Animation via TTS instead of driving video

Your approach first extracts 1D facial motion embeddings (local
facial dynamics), and 3D implicit keypoints (global pose, position, and scale) from the driver video. Is there a possibility to substitute this first step with an existing implementation to generate the animation cues from Audio/TTS? 
This would for efficient portrait animation with Audio/text+TTS as driver.