With the recent successful adaptation of transformers to the vision domain, particularly when trained in a self-supervised fashion, it has been shown that vision transformers can learn impressive object-reasoning-like behaviour and features expressive for the task of object segmentation in images. In this paper, we build on the self-supervision task of masked autoencoding and explore its effectiveness for explicitly learning object-centric representations with transformers. To this end, we design an object-centric autoencoder using transformers only and train it end-to-end to reconstruct full images from unmasked patches. We show that the model efficiently learns to decompose simple scenes as measured by segmentation metrics on several multi-object benchmarks.
翻译:最近,变压器成功地适应了视觉领域,特别是当自我监督培训时,人们已经看到,视觉变压器可以学到令人印象深刻的、与物体有关的行为和图像中物体分解任务的特征。在本文中,我们以蒙面自动编码的自我监督任务为基础,并探索其对于与变压器明确学习以物体为中心的表示法的有效性。为此,我们只使用变压器设计一个以物体为中心的自动编码器,并训练其端对端,以从无包装的补丁中重建完整的图像。我们展示模型有效地学会将简单的场景分解成以若干多目标基准的分解度测量值来衡量。