We introduce NExT-QA, a rigorously designed video question answering (VideoQA) benchmark to advance video understanding from describing to explaining the temporal actions. Based on the dataset, we set up multi-choice and open-ended QA tasks targeting causal action reasoning, temporal action reasoning, and common scene comprehension. Through extensive analysis of baselines and established VideoQA techniques, we find that top-performing methods excel at shallow scene descriptions but are weak in causal and temporal action reasoning. Furthermore, the models that are effective on multi-choice QA, when adapted to open-ended QA, still struggle in generalizing the answers. This raises doubt on the ability of these models to reason and highlights possibilities for improvement. With detailed results for different question types and heuristic observations for future works, we hope NExT-QA will guide the next generation of VQA research to go beyond superficial scene description towards a deeper understanding of videos. (The dataset and related resources are available at https://github.com/doc-doc/NExT-QA.git)
翻译:我们引入了NExT-QA,这是一个严格设计的视频解答(VideoQA)基准,以推进视频理解,从描述到解释时间行动。根据数据集,我们设置了多选择和开放式的QA任务,针对因果关系行动推理、时间行动推理和共同场景理解。通过对基线和既定视频QA技术的广泛分析,我们发现最优秀的方法在浅浅的场景描述方面是优秀的,但在因果关系和时间行动推理方面却软弱无力。此外,在多选择QA方面有效的模型,在适应开放式QA时,仍然在努力推广答案。这使人们怀疑这些模型是否有能力解释理由,并突出改进的可能性。我们希望NExT-QA将指导下一代的VQA研究,超越浅色场描述,更深入地了解视频。 (数据集和相关资源见https://github.com/doc-doc/NExT-QA.git)