Cross-modal learning of video and text plays a key role in Video Question Answering (VideoQA). In this paper, we propose a visual-text attention mechanism to utilize the Contrastive Language-Image Pre-training (CLIP) trained on lots of general domain language-image pairs to guide the cross-modal learning for VideoQA. Specifically, we first extract video features using a TimeSformer and text features using a BERT from the target application domain, and utilize CLIP to extract a pair of visual-text features from the general-knowledge domain through the domain-specific learning. We then propose a Cross-domain Learning to extract the attention information between visual and linguistic features across the target domain and general domain. The set of CLIP-guided visual-text features are integrated to predict the answer. The proposed method is evaluated on MSVD-QA and MSRVTT-QA datasets, and outperforms state-of-the-art methods.
翻译:视频和文本的跨模式学习在视频问题解答(VideoQA)中发挥着关键作用。在本文中,我们建议建立一个视觉-文字关注机制,利用关于许多通用域语言图像培训前的对比语言图像(CLIP),指导视频QA的跨模式学习。具体地说,我们首先利用目标应用域的BERT和文本功能,用时间规则提取视频特征和文本特征,并利用CLIP从普通知识域通过具体领域的学习提取一对视觉-文字特征。然后我们提出跨主题学习,以吸引目标域和一般域的视觉和语言特征之间的注意信息。CLIP制成的视觉-文字特征集集成,以预测答案。拟议方法在MSVD-QA和MSRVTTT-QA数据集上进行了评价,并超越了最先进的方法。</s>