Automatically discovering composable abstractions from raw perceptual data is a long-standing challenge in machine learning. Recent slot-based neural networks that learn about objects in a self-supervised manner have made exciting progress in this direction. However, they typically fall short at adequately capturing spatial symmetries present in the visual world, which leads to sample inefficiency, such as when entangling object appearance and pose. In this paper, we present a simple yet highly effective method for incorporating spatial symmetries via slot-centric reference frames. We incorporate equivariance to per-object pose transformations into the attention and generation mechanism of Slot Attention by translating, scaling, and rotating position encodings. These changes result in little computational overhead, are easy to implement, and can result in large gains in terms of data efficiency and overall improvements to object discovery. We evaluate our method on a wide range of synthetic object discovery benchmarks namely CLEVR, Tetrominoes, CLEVRTex, Objects Room and MultiShapeNet, and show promising improvements on the challenging real-world Waymo Open dataset.
翻译:从原始感知数据中自动发现可合成的抽象现象是机器学习中长期存在的一项挑战。最近以自我监督的方式了解物体的空格神经网络在这方面取得了令人兴奋的进展。然而,它们通常不能充分捕捉视觉世界中存在的空间对称,从而导致效率低下的样本,例如当物体出现和出现时。在本文中,我们提出了一个简单而非常有效的方法,通过空格中心参照框架将空间对称纳入空间对称。我们通过翻译、缩放和旋转位置编码,将视点成形变形的等同体纳入Slot 注意的注意和生成机制。这些变化导致很少的计算间接结果,容易执行,而且能够在数据效率和总体改进物体发现方面带来巨大收益。我们评估了我们关于一系列广泛的合成物体发现基准的方法,即CLEVR、Tetrominoes、CLEVRTex、物体室和多ShapeNet,并展示了具有挑战性的实际Wemo Open数据集的有希望的改进。