We introduce a new task, named video corpus visual answer localization (VCVAL), which aims to locate the visual answer in a large collection of untrimmed, unsegmented instructional videos using a natural language question. This task requires a range of skills - the interaction between vision and language, video retrieval, passage comprehension, and visual answer localization. In this paper, we propose a cross-modal contrastive global-span (CCGS) method for the VCVAL, jointly training the video corpus retrieval and visual answer localization subtasks. More precisely, we first enhance the video question-answer semantic by adding element-wise visual information into the pre-trained language model, and then design a novel global-span predictor through fusion information to locate the visual answer point. The global-span contrastive learning is adopted to sort the span point from the positive and negative samples with the global-span matrix. We have reconstructed a dataset named MedVidCQA, on which the VCVAL task is benchmarked. Experimental results show that the proposed method outperforms other competitive methods both in the video corpus retrieval and visual answer localization subtasks. Most importantly, we perform detailed analyses on extensive experiments, paving a new path for understanding the instructional videos, which ushers in further research.
翻译:我们引入了一个新的任务,名为视频文件视觉解答本地化(VCVAL),目的是利用自然语言问题,将视觉解答定位在大量未经剪接、未分解的教学视频中。 这项任务需要一系列技能 — — 视觉和语言的互动、视频检索、通过理解和视觉解答本地化。 在本文中,我们为VCVAL提出了一种跨模式的对比式全局(CCGS)方法,共同培训视频文件检索和视觉解答本地化子任务。 更确切地说,我们首先通过在预先培训的语言模型中添加元素智能的视觉解答语,加强视频解答语义。 然后通过整合信息设计一个新的全球分布式预测器,以定位视觉解答点。 全球范围对比学习被采用,用全球版矩阵从正负样本中分选取一个宽度点。 我们重建了一个名为MedVidCQA的数据集,作为VCVAL任务的基准。 实验结果显示, 拟议的方法超越了其他竞争性教学方法, 在详细的视频检索和视觉解答中, 最高级的路径检索和视觉解说式分析中, 。