We present a new probing dataset named PROST: Physical Reasoning about Objects Through Space and Time. This dataset contains 18,736 multiple-choice questions made from 14 manually curated templates, covering 10 physical reasoning concepts. All questions are designed to probe both causal and masked language models in a zero-shot setting. We conduct an extensive analysis which demonstrates that state-of-the-art pretrained models are inadequate at physical reasoning: they are influenced by the order in which answer options are presented to them, they struggle when the superlative in a question is inverted (e.g., most <-> least), and increasing the amount of pretraining data and parameters only yields minimal improvements. These results provide support for the hypothesis that current pretrained models' ability to reason about physical interactions is inherently limited by a lack of real world experience. By highlighting these limitations, we hope to motivate the development of models with a human-like understanding of the physical world.
翻译:我们提出了一个新的调查数据集,名为PROST:通过空间和时间对物体进行物理解释。该数据集包含来自14个手动整理的模板的18 736个多重选择问题,涵盖10个物理推理概念。所有问题的设计都是为了在零射环境中调查因果和遮蔽语言模型。我们进行了广泛的分析,表明最先进的预先培训模型在物理推理方面不够充分:它们受到向它们提出答案选项的顺序的影响,当一个问题中的超级性(例如,最起码的 < - > )被倒置时,它们挣扎着,而增加培训前数据和参数的数量只能带来微小的改进。这些结果为以下假设提供了支持,即目前预先训练的模型对物理互动进行解释的能力必然因缺乏真正的世界经验而受到限制。我们希望通过强调这些局限性来推动模型的发展,同时对物理世界有像人类一样的理解。