Video question answering (VideoQA) is an essential task in vision-language understanding, which has attracted numerous research attention recently. Nevertheless, existing works mostly achieve promising performances on short videos of duration within 15 seconds. For VideoQA on minute-level long-term videos, those methods are likely to fail because of lacking the ability to deal with noise and redundancy caused by scene changes and multiple actions in the video. Considering the fact that the question often remains concentrated in a short temporal range, we propose to first locate the question to a segment in the video and then infer the answer using the located segment only. Under this scheme, we propose "Locate before Answering" (LocAns), a novel approach that integrates a question locator and an answer predictor into an end-to-end model. During the training phase, the available answer label not only serves as the supervision signal of the answer predictor, but also is used to generate pseudo temporal labels for the question locator. Moreover, we design a decoupled alternative training strategy to update the two modules separately. In the experiments, LocAns achieves state-of-the-art performance on two modern long-term VideoQA datasets NExT-QA and ActivityNet-QA, and its qualitative examples show the reliable performance of the question localization.
翻译:视频解答(VideoQA)是视觉语言理解中的一项基本任务,最近引起了许多研究关注。然而,现有作品大多在15秒之内在短短的视频中取得有希望的表演。对于分钟级长期视频的视频QA来说,这些方法可能失败,因为缺乏处理由场景变化和视频中多种动作造成的噪音和冗余的能力。考虑到这个问题通常仍然集中在短时间范围内,我们提议首先将问题放在视频的一个部分,然后仅用定位段来推断答案。在这个计划下,我们建议“在回答之前放置” (LocAns), 这是一种新颖的方法,将问题定位器和回答预测器纳入端对端模式。在培训阶段,现有的答案标签不仅充当了答案预测器的监督信号,而且还被用来为问题定位器生成假的时间标签。此外,我们设计了一种分解的替代培训战略,以分别更新两个模块。在实验中,LocAns实现N-QA的可靠状态和N-Q-Q-A质量演示的两个现代化的N-Dal-A 的状态-Dal-Dal-A 演示中,在两个现代的N-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-I-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-I-I-A-A-A-A-A-A-A-A-A-A-A-A-A-MA-MA-MA-MA-MA-