Video Question Answering methods focus on commonsense reasoning and visual cognition of objects or persons and their interactions over time. Current VideoQA approaches ignore the textual information present in the video. Instead, we argue that textual information is complementary to the action and provides essential contextualisation cues to the reasoning process. To this end, we propose a novel VideoQA task that requires reading and understanding the text in the video. To explore this direction, we focus on news videos and require QA systems to comprehend and answer questions about the topics presented by combining visual and textual cues in the video. We introduce the ``NewsVideoQA'' dataset that comprises more than $8,600$ QA pairs on $3,000+$ news videos obtained from diverse news channels from around the world. We demonstrate the limitations of current Scene Text VQA and VideoQA methods and propose ways to incorporate scene text information into VideoQA methods.
翻译:视频问题解答方法侧重于对象或人员及其长期互动的常识推理和视觉认知。当前的视频QA 方法忽略了视频中的文本信息。 相反,我们辩称文本信息是对行动的补充,为推理过程提供了基本背景提示。为此,我们提议了一部新型视频QA任务,要求阅读和理解视频中的文字。为了探索这一方向,我们侧重于新闻视频,并要求QA系统理解和回答有关主题的问题,将视频中的视觉提示和文字提示结合起来。我们介绍了“NewsVideoQA”数据集,其中包括3 000美元以上QA配对,这些配对来自世界各地的不同新闻频道。我们展示了当前Scene Text VQA和视频QA方法的局限性,并提出了将现场文本信息纳入视频QA方法的方法。