Hi,
AV_MossFormer2_TSE_16K is awesome. I did the following and it generally works but need some clean up.
myClearVoice = ClearVoice(task='target_speaker_extraction', model_names=['AV_MossFormer2_TSE_16K'])
# #1sd calling method: process an input video and return output video, then write outputs to 'path_to_output_videos_tse'
output_wav = myClearVoice(input_path='input.mp4', online_write=True, output_path='separate_audio')
The issue is that any detection done by the model further into the video (not the start frame), the detected audio starts with the first frame leading to a huge desync in video_est_x.mp4 files.
For example if a speaker detected at 00:05 mark, the corresponding video_est_x.mp4 file will have the audio shifted to the left 5 seconds.
Thank you for advance.