A good visual representation is an inference map from observations (images) to features (vectors) that faithfully reflects the hidden modularized generative factors (semantics). In this paper, we formulate the notion of "good" representation from a group-theoretic view using Higgins' definition of disentangled representation, and show that existing Self-Supervised Learning (SSL) only disentangles simple augmentation features such as rotation and colorization, thus unable to modularize the remaining semantics. To break the limitation, we propose an iterative SSL algorithm: Iterative Partition-based Invariant Risk Minimization (IP-IRM), which successfully grounds the abstract semantics and the group acting on them into concrete contrastive learning. At each iteration, IP-IRM first partitions the training samples into two subsets that correspond to an entangled group element. Then, it minimizes a subset-invariant contrastive loss, where the invariance guarantees to disentangle the group element. We prove that IP-IRM converges to a fully disentangled representation and show its effectiveness on various benchmarks. Codes are available at https://github.com/Wangt-CN/IP-IRM.
翻译:从观察(图像)到忠实反映隐藏的模块化基因变异因素(语义)的特征(矢量),我们从一个好的视觉代表图中推断出一个真实反映隐藏的模块化变异因素(语义)。在本文中,我们用希金斯对分解的表达方式的定义,从群体理论的角度,提出“良好”的表述概念,并表明现有的自我强化学习(SSL)只是分解简单的增强特性,如旋转和颜色化,从而无法将其余的语义模块化。为了打破这一限制,我们提议了一个迭代的 SL 算法:基于循环分割的异变异风险最小化(IP-IRM),它成功地将抽象的语义表达方式和在这些表达方式上采取行动的小组置于具体的对比学习之中。在每一次迭代中,IP-IRM首先将培训样本分成两个与混合组元素相对应的子组。然后,它将子组变异性对比性损失降到最低程度,从而保证分离该组元素。我们证明IP-ICMRM是完全分解-RMM/RM/CO在各种基准上是有效的。