Video Question Answering (VideoQA) aims to answer natural language questions according to the given videos. It has earned increasing attention with recent research trends in joint vision and language understanding. Yet, compared with ImageQA, VideoQA is largely underexplored and progresses slowly. Although different algorithms have continually been proposed and shown success on different VideoQA datasets, we find that there lacks a meaningful survey to categorize them, which seriously impedes its advancements. This paper thus provides a clear taxonomy and comprehensive analyses to VideoQA, focusing on the datasets, algorithms, and unique challenges. We then point out the research trend of studying beyond factoid QA to inference QA towards the cognition of video contents, Finally, we conclude some promising directions for future exploration.
翻译:视频问题解答(VideoQA)旨在根据给定的视频回答自然语言问题,在共同愿景和语言理解方面最近的研究趋势中引起越来越多的关注。然而,与图像QA相比,视频QA的探索不足,进展缓慢。虽然不断提出不同的算法,并在不同的视频QA数据集中表现出成功,但我们发现缺乏有意义的调查来对其进行分类,这严重妨碍了其进展。因此,本文为视频QA提供了清晰的分类和全面分析,重点是数据集、算法和独特挑战。然后我们指出研究“QA”以外的研究趋势,以推断“QA”对视频内容的认知。最后,我们为今后的探索达成一些有希望的方向。