We propose a novel theoretical framework to understand self-supervised learning methods that employ dual pairs of deep ReLU networks (e.g., SimCLR, BYOL). First, we prove that in each SGD update of SimCLR with various loss functions (simple contrastive loss, soft Triplet loss and InfoNCE loss), the weights at each layer are updated by a \emph{covariance operator} that specifically amplifies initial random selectivities that vary across data samples but survive averages over data augmentations. We show this leads to the emergence of hierarchical features, if the input data are generated from a hierarchical latent tree model. With the same framework, we also show analytically that in BYOL, the combination of BatchNorm and a predictor network creates an implicit contrastive term, acting as an approximate covariance operator. Additionally, for linear architectures we derive exact solutions for BYOL that provide conceptual insights into how BYOL can learn useful non-collapsed representations without any contrastive terms that separate negative pairs. Extensive ablation studies justify our theoretical findings.
翻译:我们提出了一个新的理论框架,以理解使用深ReLU网络双对自监督的学习方法(例如SimClR, BYOL)。首先,我们证明,在具有各种损失功能的SGD更新SimCLR(简单的对比损失、软Triplet损失和InfoNCE损失)中,每一层的重量都由memph{Covoliance Averitorits}来更新,这特别放大了数据样本之间不同、但在数据扩增的平均值上生存下来的初始随机选择。我们显示,如果输入数据来自一个等级潜藏树模型,这会导致等级特征的出现。我们用同样的框架从分析上表明,BEOL,BatchNorm和预测网络的组合产生了一个隐含的对比术语,作为近似共变数操作器。此外,对于BYOL的线性结构,我们为BYOL提供了精确的解决方案,提供概念洞察,说明BYOL如何在不区分负对等词的情况下学习有用的非折叠式表述。广度研究证明我们的理论结论结论结论。