The appearance of the same object may vary in different scene images due to perspectives and occlusions between objects. Humans can easily identify the same object, even if occlusions exist, by completing the occluded parts based on its canonical image in the memory. Achieving this ability is still a challenge for machine learning, especially under the unsupervised learning setting. Inspired by such an ability of humans, this paper proposes a compositional scene modeling method to infer global representations of canonical images of objects without any supervision. The representation of each object is divided into an intrinsic part, which characterizes globally invariant information (i.e. canonical representation of an object), and an extrinsic part, which characterizes scene-dependent information (e.g., position and size). To infer the intrinsic representation of each object, we employ a patch-matching strategy to align the representation of a potentially occluded object with the canonical representations of objects, and sample the most probable canonical representation based on the category of object determined by amortized variational inference. Extensive experiments are conducted on four object-centric learning benchmarks, and experimental results demonstrate that the proposed method not only outperforms state-of-the-arts in terms of segmentation and reconstruction, but also achieves good global object identification performance.
翻译:同一对象的外观可能因不同对象之间的视角和隔离而在不同场景图像中有所不同。 人类可以很容易地识别同一对象, 即使存在封闭性, 也可以通过根据记忆中的光学图像完成隐蔽部分。 实现这一能力仍然是机器学习的一个挑战, 特别是在无人监督的学习环境中。 由于人类的这种能力, 本文建议了一种构成性场景模型方法, 用以推导未受任何监督的物体的光学图像的全球映射。 每个对象的外观都分为一个内在部分, 其特征是全球异变性信息( 对象的卡通性表示), 以及一个外观部分, 其特征是取决于场景的信息( 例如, 位置和大小) 。 为了推断每个对象的内在代表性, 我们采用了一种组合式的策略, 将可能隐蔽的物体的表示与天体的直观映射性表示相匹配, 并抽样最有可能的剖析性表示性表示, 其特征是全球异性信息( ) 和外观的外观性部分是外观性实验性、 实验性实验性、 也只是用四种目标的外观分析 。