In order for AI to be safely deployed in real-world scenarios such as hospitals, schools, and the workplace, it must be able to robustly reason about the physical world. Fundamental to this reasoning is physical common sense: understanding the physical properties and affordances of available objects, how they can be manipulated, and how they interact with other objects. Physical commonsense reasoning is fundamentally a multi-sensory task, since physical properties are manifested through multiple modalities - two of them being vision and acoustics. Our paper takes a step towards real-world physical commonsense reasoning by contributing PACS: the first audiovisual benchmark annotated for physical commonsense attributes. PACS contains 13,400 question-answer pairs, involving 1,377 unique physical commonsense questions and 1,526 videos. Our dataset provides new opportunities to advance the research field of physical reasoning by bringing audio as a core component of this multimodal problem. Using PACS, we evaluate multiple state-of-the-art models on our new challenging task. While some models show promising results (70% accuracy), they all fall short of human performance (95% accuracy). We conclude the paper by demonstrating the importance of multimodal reasoning and providing possible avenues for future research.
翻译:为使大赦国际安全地部署在医院、学校和工作场所等现实世界的情景中,它必须能够对物理世界有强烈的理性。这一推理的根本是物理常识:了解现有物体的物理属性和负担能力,如何加以操纵,以及它们如何与其他物体互动。 物理常识推理从根本上说是一项多重感知性任务,因为物理属性通过多种方式表现出来 — — 其中两种方式是视觉和声学。 我们的文件通过提供PACS,向真实世界的物理常识推理迈出了一步:第一个对物理常识特性附加说明的视听基准。PACS包含13 400对问答配对,涉及1 377个独特的物理常识问题和1 526个视频。我们的数据集提供了新的机会,通过将音频作为这一多式联运问题的核心组成部分来推进物理推理的研究领域。我们利用PACS,评估我们新的挑战性任务的多种最先进的模型。虽然有些模型显示有希望的结果(70%的精确度),但它们都低于人类的性能(95%的精确度)。我们通过展示多式联运推理学的重要性和为未来研究提供可能的途径来结束论文。