The ability to decompose complex natural scenes into meaningful object-centric abstractions lies at the core of human perception and reasoning. In the recent culmination of unsupervised object-centric learning, the Slot-Attention module has played an important role with its simple yet effective design and fostered many powerful variants. These methods, however, have been exceedingly difficult to train without supervision and are ambiguous in the notion of object, especially for complex natural scenes. In this paper, we propose to address these issues by investigating the potential of learnable queries as initializations for Slot-Attention learning, uniting it with efforts from existing attempts on improving Slot-Attention learning with bi-level optimization. With simple code adjustments on Slot-Attention, our model, Bi-level Optimized Query Slot Attention, achieves state-of-the-art results on 3 challenging synthetic and 7 complex real-world datasets in unsupervised image segmentation and reconstruction, outperforming previous baselines by a large margin. We provide thorough ablative studies to validate the necessity and effectiveness of our design. Additionally, our model exhibits great potential for concept binding and zero-shot learning. Our work is made publicly available at https://bo-qsa.github.io
翻译:将复杂的自然场景分解为有意义的以物体为中心的抽象概念的能力,是人类感知和推理的核心。在最近未经监督的以物体为中心的学习的高潮中,Slot-Atention模块发挥了重要的作用,设计简单而有效的设计,培育了许多强大的变体。然而,这些方法在没有监督的情况下极难培训,在物体概念方面,特别是在复杂的自然场景方面,是模棱两可的。在本文件中,我们提议通过研究作为Slot-Atention学习初始化的可学习查询的可能性来解决这些问题,将它与现有的改进斯洛特-Atention学习的努力结合起来。在Slot-Atention(Slot-Atention)和双层优化的现有努力中,它发挥了重要的作用。在Slot-Atention(我们的模型,双层最佳Querty Slot at Retective)上简单代码调整了我们的代码,在3个挑战性合成和7个复杂的真实世界数据集上取得了最先进的结果。我们用一个很大的空间来表现以前的基线。我们提供了彻底的平底研究,用来验证我们设计的必要性和效率。