An official implementation for " UniVL: A Unified Video and Language Pre-Training Model for Multimodal Understanding and Generation"
video
localization
caption
alignment
segmentation
coin
multimodality
joint
multimodal-sentiment-analysis
pretrain
pretraining
msrvtt
video-text-retrieval
video-text
video-language
youcookii
retrieval-task
caption-task
-
Updated
Jul 25, 2024 - Python