Visual scene representation learning is an important research problem in the field of computer vision. The performance on vision tasks could be improved if more suitable representations are learned for visual scenes. Complex visual scenes are the composition of relatively simple visual concepts, and have the property of combinatorial explosion. Compared with directly representing the entire visual scene, extracting compositional scene representations can better cope with the diverse combination of background and objects. Because compositional scene representations abstract the concept of objects, performing visual scene analysis and understanding based on these representations could be easier and more interpretable. Moreover, learning compositional scene representations via reconstruction can greatly reduce the need for training data annotations. Therefore, compositional scene representation learning via reconstruction has important research significance. In this survey, we first discuss representative methods that either learn from a single viewpoint or multiple viewpoints without object-level supervision, then the applications of compositional scene representations, and finally the future directions on this topic.
翻译:视觉场面表现学习是计算机视觉领域的一个重要研究问题。如果为视觉场面学习更合适的表现,视觉任务的表现是可以改进的。复杂的视觉场面是相对简单的视觉概念的构成,具有组合式爆炸的特性。与直接代表整个视觉场面相比,提取构成场面表现可以更好地应对背景和物体的不同组合。由于构成场面表现抽象了物体的概念,根据这些表现进行视觉场面分析和理解可以更容易和更容易解释。此外,通过重建学习构成场面表现可以大大减少培训数据说明的需要。因此,通过重建学习构成场面表现具有重要的研究意义。在这次调查中,我们首先讨论代表方法,要么从单一角度学习,要么在没有目标层面监督的情况下从多个角度学习,然后是组合场面表现的应用,最后是这一专题的未来方向。