基于物体中心的槽扩散 (Object-Centric Slot Diffusion)

Despite remarkable recent advances, making object-centric learning work for complex natural scenes remains the main challenge. The recent success of adopting the transformer-based image generative model in object-centric learning suggests that having a highly expressive image generator is crucial for dealing with complex scenes. In this paper, inspired by this observation, we aim to answer the following question: can we benefit from the other pillar of modern deep generative models, i.e., the diffusion models, for object-centric learning and what are the pros and cons of such a model? To this end, we propose a new object-centric learning model, Latent Slot Diffusion (LSD). LSD can be seen from two perspectives. From the perspective of object-centric learning, it replaces the conventional slot decoders with a latent diffusion model conditioned on the object slots. Conversely, from the perspective of diffusion models, it is the first unsupervised compositional conditional diffusion model which, unlike traditional diffusion models, does not require supervised annotation such as a text description to learn to compose. In experiments on various object-centric tasks, including the FFHQ dataset for the first time in this line of research, we demonstrate that LSD significantly outperforms the state-of-the-art transformer-based decoder, particularly when the scene is more complex. We also show a superior quality in unsupervised compositional generation.

翻译：尽管最近取得了卓越的进展，但在复杂的自然场景中使基于物体中心的学习有效仍然是主要挑战。采用基于Transformer的图像生成模型在物体中心学习中的最近成功表明，具有高度表达性的图像生成器对于处理复杂场景至关重要。在本文中，受到这一观察的启发，我们的目标是回答以下问题：我们可以从现代深度生成模型的另一支柱——扩散模型中受益于物体中心学习吗？这样的模型的优缺点是什么？为此，我们提出了一种新的物体中心学习模型，潜在槽扩散（LSD）。 LSD可以从两个角度来看。从物体中心学习的角度来看，它用潜在扩散模型取代传统的槽解码器，其条件是物体槽。相反，从扩散模型的角度来看，它是第一个无监督的组合条件扩散模型，与传统的扩散模型不同，它不需要监督的注释，如文本描述来学习组合。在各种物体中心任务的实验中，包括首次在这一研究领域中使用FFHQ数据集，我们证明LSD在复杂场景下显著优于现有的基于Transformer的解码器。我们还展示了无监督组合生成的优越质量。

相关内容

MoDELS

关注 43

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【AAAI2023】用于复杂场景图像合成的特征金字塔扩散模型

专知会员服务

22+阅读 · 2022年12月5日

【CVPR 2022】多模态视频字幕的端到端生成预训练，End-to-end Generative Pretraining for Multimodal Video Captioning

专知会员服务

27+阅读 · 2022年3月3日

【CVPR 2022】使用多模态Transformer的端到端视频对象分割，End-to-End Referring Video Object Segmentation with Multimodal Transformer

专知会员服务

28+阅读 · 2022年3月3日

【AAAI2021】双级协作变换器Transformer图像描述生成

专知会员服务

27+阅读 · 2021年1月26日