A compositional understanding of the world in terms of objects and their geometry in 3D space is considered a cornerstone of human cognition. Facilitating the learning of such a representation in neural networks holds promise for substantially improving labeled data efficiency. As a key step in this direction, we make progress on the problem of learning 3D-consistent decompositions of complex scenes into individual objects in an unsupervised fashion. We introduce Object Scene Representation Transformer (OSRT), a 3D-centric model in which individual object representations naturally emerge through novel view synthesis. OSRT scales to significantly more complex scenes with larger diversity of objects and backgrounds than existing methods. At the same time, it is multiple orders of magnitude faster at compositional rendering thanks to its light field parametrization and the novel Slot Mixer decoder. We believe this work will not only accelerate future architecture exploration and scaling efforts, but it will also serve as a useful tool for both object-centric as well as neural scene representation learning communities.
翻译:在3D空间对世界的构成及其几何学的理解被认为是人类认知的基石。促进在神经网络中学习这种表达方式对于大幅度提高标签数据效率很有希望。作为朝这个方向迈出的关键一步,我们在学习以不受监督的方式将复杂场景的三维一致分解成单个物体的问题上取得了进展。我们引入了三维天体代表变异器(OSRT),这是一个三维中心模型,每个物体通过新颖的视图合成自然出现。OSRT尺度使比现有方法更复杂的场景更加复杂,其对象和背景的多样性更大。同时,由于它的浅色场和新颖的Slolt Mixer解码器,其构成速度更快。我们认为,这项工作不仅将加速未来的结构探索和扩展努力,而且还将作为一个有用的工具,供以物体为中心以及以神经为代表的场学习社区使用。