Multi-modal video question answering aims to predict correct answer and localize the temporal boundary relevant to the question. The temporal annotations of questions improve QA performance and interpretability of recent works, but they are usually empirical and costly. To avoid the temporal annotations, we devise a weakly supervised question grounding (WSQG) setting, where only QA annotations are used and the relevant temporal boundaries are generated according to the temporal attention scores. To substitute the temporal annotations, we transform the correspondence between frames and subtitles to Frame-Subtitle (FS) self-supervision, which helps to optimize the temporal attention scores and hence improve the video-language understanding in VideoQA model. The extensive experiments on TVQA and TVQA+ datasets demonstrate that the proposed WSQG strategy gets comparable performance on question grounding, and the FS self-supervision helps improve the question answering and grounding performance on both QA-supervision only and full-supervision settings.
翻译:多式视频问题解答旨在预测正确答案并使与该问题相关的时间界限本地化。问题的时间说明提高了QA的性能和最近作品的可解释性,但通常都是经验性的和昂贵的。为了避免时间说明,我们设计了一个监督不力的问题定位设置(WSQG),其中只使用QA说明,并根据时间关注分数生成相关的时间界限。为替代时间说明,我们将框架和字幕之间的对应关系转换为框架字幕(FS)自我监督视,这有助于优化时间关注分数,从而改进视频QA模型中的视频语言理解。关于TVQA和TVQA+数据集的广泛实验表明,拟议的WSQG战略在问题定位上取得了可比的绩效,而FS的自我监督视野有助于改进问题回答和定位在QA监督场和完全监督环境中的业绩。