Existing video understanding datasets mostly focus on human interactions, with little attention being paid to the "in the wild" settings, where the videos are recorded outdoors. We propose WILDQA, a video understanding dataset of videos recorded in outside settings. In addition to video question answering (Video QA), we also introduce the new task of identifying visual support for a given question and answer (Video Evidence Selection). Through evaluations using a wide range of baseline models, we show that WILDQA poses new challenges to the vision and language research communities. The dataset is available at https://lit.eecs.umich.edu/wildqa/.
翻译:现有视频理解数据集主要侧重于人类互动,很少注意“野生”环境中的视频录制户外。我们提议WILDQA,即外部视频录制的视频理解数据集。除了视频问答(Video QA)外,我们还引入了确定对特定问答的视觉支持(Video 证据选择)的新任务。通过使用广泛的基线模型进行评估,我们显示WILDQA对视觉和语言研究界提出了新的挑战。数据集可在https://lit.eecs.umich.edu/wildqa/上查阅。