Visual scenes are composed of visual concepts and have the property of combinatorial explosion. An important reason for humans to efficiently learn from diverse visual scenes is the ability of compositional perception, and it is desirable for artificial intelligence to have similar abilities. Compositional scene representation learning is a task that enables such abilities. In recent years, various methods have been proposed to apply deep neural networks, which have been proven to be advantageous in representation learning, to learn compositional scene representations via reconstruction, advancing this research direction into the deep learning era. Learning via reconstruction is advantageous because it may utilize massive unlabeled data and avoid costly and laborious data annotation. In this survey, we first outline the current progress on reconstruction-based compositional scene representation learning with deep neural networks, including development history and categorizations of existing methods from the perspectives of the modeling of visual scenes and the inference of scene representations; then provide benchmarks, including an open source toolbox to reproduce the benchmark experiments, of representative methods that consider the most extensively studied problem setting and form the foundation for other methods; and finally discuss the limitations of existing methods and future directions of this research topic.
翻译:视觉场景由视觉概念组成,具有组合式爆炸的特性。人类从不同视觉场景中有效学习的一个重要理由,是组成感的能力,人工智能最好具有类似的能力。组合场景展示学习是一项使这种能力得以实现的任务。近年来,人们提议采用各种方法应用深层神经网络,这些网络已证明在代表性学习方面有优势,通过重建学习构成的场景展示,将这一研究方向推进到深层次的学习时代。通过重建学习是有益的,因为它可能使用大量未贴标签的数据,并避免花费昂贵和费力气的数据说明。在这次调查中,我们首先从视觉场景建模和场面展示的推断的角度,概述当前以重建为基础的构造场景展示学习与深层神经网络的进度,包括发展历史和现有方法的分类;然后提供基准,包括一个用于复制基准实验的公开来源工具箱,这些工具考虑到最广泛研究的问题的确定,并构成其他方法的基础;最后讨论目前方法的局限性和本研究专题的未来方向。