Data of general object images have two most common structures: (1) each object of a given shape can be rendered in multiple different views, and (2) shapes of objects can be categorized in such a way that the diversity of shapes is much larger across categories than within a category. Existing deep generative models can typically capture either structure, but not both. In this work, we introduce a novel deep generative model, called CIGMO, that can learn to represent category, shape, and view factors from image data. The model is comprised of multiple modules of shape representations that are each specialized to a particular category and disentangled from view representation, and can be learned using a group-based weakly supervised learning method. By empirical investigation, we show that our model can effectively discover categories of object shapes despite large view variation and quantitatively supersede various previous methods including the state-of-the-art invariant clustering algorithm. Further, we show that our approach using category-specialization can enhance the learned shape representation to better perform down-stream tasks such as one-shot object identification as well as shape-view disentanglement.
翻译:一般对象图像的数据有两个最常见的结构:(1) 特定形状的每个对象可以以多种不同的观点提供,(2) 对象的形状可以分类,使形状的多样性在类别中比类别中大得多。 现有的深层次基因化模型一般可以捕捉其中任何一个结构, 但不是两者兼有。 在这项工作中, 我们引入了一个新的深层次基因化模型, 叫做 CIGMO, 它可以学习从图像数据中代表类别、 形状和观察要素。 该模型由多个形状表示模块组成, 每一个形状都是专门针对特定类别的, 与视觉表示方式脱钩, 并且可以通过基于群体、 薄弱监督的学习方法来学习。 我们通过实验性调查, 我们显示我们的模型可以有效地发现对象形状的类别, 尽管存在巨大的视图变异, 并且数量上取代了先前的各种方法, 包括状态的变异组合算法。 此外, 我们显示, 使用分类化的方法可以加强学习的形状代表方式, 以更好地执行下流任务, 如单向对象识别以及形状变异。