Visual scene representation learning is an important research problem in the field of computer vision. The performance of artificial intelligence systems on vision tasks could be improved if more suitable representations are learned for visual scenes. Complex visual scenes are composed of relatively simple visual concepts, and have the property of combinatorial explosion. Compared with directly representing the entire visual scene, extracting compositional scene representations can better cope with the diverse combinations of background and objects. Because compositional scene representations abstract the concept of objects, performing visual scene analysis and understanding based on these representations could be easier and more interpretable. Moreover, learning via reconstruction can greatly reduce the need for training data annotations. Therefore, reconstruction-based compositional scene representation learning has important research significance. In this survey, we first outline the current progress on this research topic, including development history and categorizations of existing methods from the perspectives of modeling of visual scenes and inference of scene representations; then provide benchmarks, including an open source toolbox to reproduce the benchmark experiments, of representative methods that consider the most extensively studied problem setting and form the foundation for other methods; and finally discuss the future directions of this research topic.
翻译:视觉场面演示是计算机视觉领域的一个重要研究问题。如果为视觉场景学习更合适的演示,那么视觉任务人工智能系统的性能是可以改进的。复杂的视觉场景由相对简单的视觉概念组成,具有组合爆炸的特性。与直接代表整个视觉场景相比,提取成像场面演示可以更好地应对背景和物体的不同组合。因为成像场展示抽象了物体的概念,根据这些演示进行视觉场面分析和理解可以更容易解释。此外,通过重建学习可以大大减少培训数据说明的需要。因此,基于重建的构成场面展示学习具有重要的研究意义。我们在这次调查中,首先从视觉场景的建模和场面展示的推断的角度概述目前研究专题的进展,包括发展历史和现有方法的分类;然后提供基准,包括一个用于复制基准实验的公开来源工具箱,即考虑最广泛研究的问题设置和形成其他方法的基础的代表性方法;最后讨论这一研究专题的未来方向。