A critical aspect of human visual perception is the ability to parse visual scenes into individual objects and further into object parts, forming part-whole hierarchies. Such composite structures could induce a rich set of semantic concepts and relations, thus playing an important role in the interpretation and organization of visual signals as well as for the generalization of visual perception and reasoning. However, existing visual reasoning benchmarks mostly focus on objects rather than parts. Visual reasoning based on the full part-whole hierarchy is much more challenging than object-centric reasoning due to finer-grained concepts, richer geometry relations, and more complex physics. Therefore, to better serve for part-based conceptual, relational and physical reasoning, we introduce a new large-scale diagnostic visual reasoning dataset named PTR. PTR contains around 70k RGBD synthetic images with ground truth object and part level annotations regarding semantic instance segmentation, color attributes, spatial and geometric relationships, and certain physical properties such as stability. These images are paired with 700k machine-generated questions covering various types of reasoning types, making them a good testbed for visual reasoning models. We examine several state-of-the-art visual reasoning models on this dataset and observe that they still make many surprising mistakes in situations where humans can easily infer the correct answer. We believe this dataset will open up new opportunities for part-based reasoning.
翻译:人类视觉感知的一个重要方面是能够将视觉场景分析成单个物体,然后进一步分析成物体部分,形成完整的等级结构。这种复合结构可以产生一套丰富的语义概念和关系,从而在视觉信号的解释和组织以及视觉感知和推理的概括化中发挥重要作用。然而,现有的视觉推理基准主要侧重于物体而不是部分。基于整个整体层次的视觉推理比以物体为中心的推理更具有挑战性。这些图像与700公里的机器推理相匹配,涉及各种类型的推理,因此,为了更好地为基于部分的概念、关系和物理推理服务,我们引入了一个新的大规模诊断性视觉推理数据集PTR。PTR包含约70k RGBD合成图像,其中含有地面真实对象和部分说明,涉及语义分解、颜色属性、空间和几何关系,以及某些物理特性,如稳定性。这些图像与700公里的机器推理问题相配对,涉及各种类型的推理,因此它们可以成为视觉推理模型的好测试台。我们研究了一些州际的视觉推理学模型,在其中可以轻易地观察到许多数据推理学选择。