To date, visual question answering (VQA) (i.e., image QA and video QA) is still a holy grail in vision and language understanding, especially for video QA. Compared with image QA that focuses primarily on understanding the associations between image region-level details and corresponding questions, video QA requires a model to jointly reason across both spatial and long-range temporal structures of a video as well as text to provide an accurate answer. In this paper, we specifically tackle the problem of video QA by proposing a Structured Two-stream Attention network, namely STA, to answer a free-form or open-ended natural language question about the content of a given video. First, we infer rich long-range temporal structures in videos using our structured segment component and encode text features. Then, our structured two-stream attention component simultaneously localizes important visual instance, reduces the influence of background video and focuses on the relevant text. Finally, the structured two-stream fusion component incorporates different segments of query and video aware context representation and infers the answers. Experiments on the large-scale video QA dataset \textit{TGIF-QA} show that our proposed method significantly surpasses the best counterpart (i.e., with one representation for the video input) by 13.0%, 13.5%, 11.0% and 0.3 for Action, Trans., TrameQA and Count tasks. It also outperforms the best competitor (i.e., with two representations) on the Action, Trans., TrameQA tasks by 4.1%, 4.7%, and 5.1%.
翻译:迄今为止,视频解答(VQA)(即图像 QA和视频 QA)仍然是视觉和语言理解的神圣弱点,特别是视频 QA。与主要侧重于理解图像区域级细节和相应问题之间的关联的图像 QA相比,视频 QA要求一种模型,在视频和文本的空间和长距离时间结构中共同解释,以提供准确的答案。在本文中,我们具体解决视频 QA 的问题,方法是提出一个结构化双流关注网络,即STA,以回答关于特定视频内容的免费或开放式自然语言问题。首先,我们用我们结构化的段段和相应的问题来推断视频中丰富的远程时间结构。然后,我们结构化的双流关注部分同时包含重要的视觉实例,降低背景视频的影响力,并侧重于相关文本。最后,结构化的双流融合部分包含不同的查询和视频识别背景介绍和解答。在大规模视频表达式上,5OO级A 和最高级的版本任务中,通过一个比例的动作A URA 4.A 格式,通过一个工具显示一个大规模的 VIFA 方向,一个比例数据分析 Q。