Sensory data are often comprised of independent content and transformation factors. For example, face images may have shapes as content and poses as transformation. To infer separately these factors from given data, various ``disentangling'' models have been proposed. However, many of these are supervised or semi-supervised, either requiring attribute labels that are often unavailable or disallowing for generalization over new contents. In this study, we introduce a novel deep generative model, called group-based variational autoencoders. In this, we assume no explicit labels, but a weaker form of structure that groups together data instances having the same content but transformed differently; we thereby separately estimate a group-common factor as content and an instance-specific factor as transformation. This approach allows for learning to represent a general continuous space of contents, which can accommodate unseen contents. Despite the simplicity, our model succeeded in learning, from five datasets, content representations that are highly separate from the transformation representation and generalizable to data with novel contents. We further provide detailed analysis of the latent content code and show insight into how our model obtains the notable transformation invariance and content generalizability.
翻译:感官数据通常由独立的内容和转换因素组成。 例如, 脸部图像的形状可能是内容的形状, 和变形的形状。 为了将这些因素与给定的数据分开, 提出了不同的“ 分解” 模型。 但是, 其中许多模型是受到监督或半监督的, 但它们要么需要通常无法找到的属性标签, 要么不允许对新内容进行概括化。 在本研究中, 我们引入了一种新的深层次的基因化模型, 叫做基于集体的变异自动转换器。 在此, 我们假设没有明确的标签, 而是一种较弱的结构形式, 将具有相同内容但变化不同的数据案例组合在一起; 因此, 我们单独估计了一个群体共同因素作为内容, 而一个具体实例作为变异的因素。 这种方法使得学习能够代表整个内容的连续空间, 能够容纳看不见的内容。 尽管简单, 我们的模式成功地从五个数据集中学习了内容的表达方式, 与变异的表示方式非常不同, 并且与带有新内容的数据相通。 我们进一步详细分析潜在内容代码, 并洞察我们的模型是如何获得显著的变异和内容一般性。