Passive visual systems typically fail to recognize objects in the amodal setting where they are heavily occluded. In contrast, humans and other embodied agents have the ability to move in the environment, and actively control the viewing angle to better understand object shapes and semantics. In this work, we introduce the task of Embodied Visual Recognition (EVR): An agent is instantiated in a 3D environment close to an occluded target object, and is free to move in the environment to perform object classification, amodal object localization, and amodal object segmentation. To address this, we develop a new model called Embodied Mask R-CNN, for agents to learn to move strategically to improve their visual recognition abilities. We conduct experiments using the House3D environment. Experimental results show that: 1) agents with embodiment (movement) achieve better visual recognition performance than passive ones; 2) in order to improve visual recognition abilities, agents can learn strategical moving paths that are different from shortest paths.
翻译:被动视觉系统通常无法识别在高度隐蔽的方位环境中的物体。 相反, 人类和其他体现的物剂有能力在环境中移动, 并积极控制查看角度, 以更好地了解对象形状和语义。 在这项工作中, 我们引入了 Embodied 视觉识别任务( EVR ): 一种物剂在靠近隐蔽目标对象的三维环境中即刻发生, 并且可以自由地在环境中移动, 以进行物体分类、 现代物体本地化和现代物体分割。 为了解决这个问题, 我们开发了一种新的模型, 名为 Embodied Mask R- CNN, 使物剂能够从战略上学习如何移动, 以提高其视觉识别能力。 我们使用 Hous3D 环境进行实验。 实验结果显示:(1) 有感化( 移动) 的物剂比被动物体取得更好的视觉识别性能;(2) 为了提高视觉识别能力, 代理人可以学习与最短路径不同的战略移动路径。