The ability to decompose scenes into their object components is a desired property for autonomous agents, allowing them to reason and act in their surroundings. Recently, different methods have been proposed to learn object-centric representations from data in an unsupervised manner. These methods often rely on latent representations learned by deep neural networks, hence requiring high computational costs and large amounts of curated data. Such models are also difficult to interpret. To address these challenges, we propose the Phase-Correlation Decomposition Network (PCDNet), a novel model that decomposes a scene into its object components, which are represented as transformed versions of a set of learned object prototypes. The core building block in PCDNet is the Phase-Correlation Cell (PC Cell), which exploits the frequency-domain representation of the images in order to estimate the transformation between an object prototype and its transformed version in the image. In our experiments, we show how PCDNet outperforms state-of-the-art methods for unsupervised object discovery and segmentation on simple benchmark datasets and on more challenging data, while using a small number of learnable parameters and being fully interpretable.
翻译:将场景分解到其对象组件中的能力是自主代理器的一种理想属性,允许它们在其周围进行理性和行为。最近,提出了不同的方法,从数据中以不受监督的方式学习以物体为中心的表达方式。这些方法往往依赖深层神经网络所学的潜在表达方式,因此需要高昂的计算成本和大量整理数据。这些模型也难以解释。为了应对这些挑战,我们提议了阶段校正分解网络(PCDNet),这是一个将场景分解成其对象组件的新模型,它代表着一组已学习的原型的转变版本。PCDNet的核心构件是阶段校正单元(PC Cell),该单元利用图像的频率-主路面表示方式来估计物体原型与图像中已变版本之间的转换情况。在我们的实验中,我们展示了PCDNet如何在简单基准数据集和更具挑战性的数据上,同时使用少量的可学习参数和完全可解释的参数,从而在图像中超越了状态。