Self-supervised representation learning has shown remarkable success in a number of domains. A common practice is to perform data augmentation via hand-crafted transformations intended to leave the semantics of the data invariant. We seek to understand the empirical success of this approach from a theoretical perspective. We formulate the augmentation process as a latent variable model by postulating a partition of the latent representation into a content component, which is assumed invariant to augmentation, and a style component, which is allowed to change. Unlike prior work on disentanglement and independent component analysis, we allow for both nontrivial statistical and causal dependencies in the latent space. We study the identifiability of the latent representation based on pairs of views of the observations and prove sufficient conditions that allow us to identify the invariant content partition up to an invertible mapping in both generative and discriminative settings. We find numerical simulations with dependent latent variables are consistent with our theory. Lastly, we introduce Causal3DIdent, a dataset of high-dimensional, visually complex images with rich causal dependencies, which we use to study the effect of data augmentations performed in practice.
翻译:自我监督的代表学习在一些领域取得了显著成功。 一个常见的做法是通过手工制作的转换来进行数据增强,目的是留下数据变异的语义。 我们试图从理论角度理解这一方法的经验成功。 我们将增强过程作为一个潜在的变量模型, 将潜在代表分成一个内容部分, 假设该潜在代表部分是不会增加的, 以及一个允许改变的风格部分。 不同于先前关于分解和独立组成部分分析的工作, 我们允许在潜空中进行非边际统计和因果关系。 我们根据观察的对齐观点研究潜在代表的可识别性, 并证明足够条件, 使我们能够在变异和有区别的环境下, 辨别出不可视的表达性, 以不可视同的方式绘制出一个不可视的表达式的表达性分布图。 我们发现与我们的理论一致。 最后, 我们引入了一个高维度、 视觉复杂的图像数据集, 以及丰富的因果关系。 我们使用这些数据来研究实践中数据扩展的影响 。