Visual scenes are extremely rich in diversity, not only because there are infinite combinations of objects and background, but also because the observations of the same scene may vary greatly with the change of viewpoints. When observing a visual scene that contains multiple objects from multiple viewpoints, humans are able to perceive the scene in a compositional way from each viewpoint, while achieving the so-called "object constancy" across different viewpoints, even though the exact viewpoints are untold. This ability is essential for humans to identify the same object while moving and to learn from vision efficiently. It is intriguing to design models that have the similar ability. In this paper, we consider a novel problem of learning compositional scene representations from multiple unspecified viewpoints without using any supervision, and propose a deep generative model which separates latent representations into a viewpoint-independent part and a viewpoint-dependent part to solve this problem. To infer latent representations, the information contained in different viewpoints is iteratively integrated by neural networks. Experiments on several specifically designed synthetic datasets have shown that the proposed method is able to effectively learn from multiple unspecified viewpoints.
翻译:视觉场景的多样性极为丰富, 不仅因为天体和背景有无限的组合, 而且还因为同一场景的观测可能随着观点的变化而有很大的差异。 在观察从多重角度观察包含多个对象的视觉场景时, 人类能够从每个角度以构思方式看待场景, 同时在不同观点之间实现所谓的“ 对象延缩”, 即使确切的视角是无法形容的。 这种能力对于人类在移动和从视觉中有效学习时识别同一个对象至关重要。 它对于设计具有类似能力的模型非常感兴趣。 在本文中, 我们考虑到一个新问题,就是在不使用任何监督的情况下从多个未明确角度学习构造场景的表达方式, 并且提出一个深层次的基因化模型, 将潜在表达方式分为一个与观点独立的部分, 以及一个与观点独立的部分来解决这一问题。 在推断潜在表达中, 不同观点中所含的信息通过神经网络是迭接的。 在几个专门设计的合成数据集上进行的实验表明, 拟议的方法能够有效地从多个未具体说明的观点中学习。