Contrastive approaches to self-supervised learning (SSL) learn representations by minimizing the distance between two augmented views of the same data point (positive pairs) and maximizing the same from different data points (negative pairs). However, recent approaches like BYOL and SimSiam, show remarkable performance {\it without} negative pairs, raising a fundamental theoretical question: how can SSL with only positive pairs avoid representational collapse? We study the nonlinear learning dynamics of non-contrastive SSL in simple linear networks. Our analysis yields conceptual insights into how non-contrastive SSL methods learn, how they avoid representational collapse, and how multiple factors, like predictor networks, stop-gradients, exponential moving averages, and weight decay all come into play. Our simple theory recapitulates the results of real-world ablation studies in both STL-10 and ImageNet. Furthermore, motivated by our theory we propose a novel approach that \emph{directly} sets the predictor based on the statistics of its inputs. In the case of linear predictors, our approach outperforms gradient training of the predictor by $5\%$ and on ImageNet it performs comparably with more complex two-layer non-linear predictors that employ BatchNorm. Code is released in https://github.com/facebookresearch/luckmatters/tree/master/ssl.
翻译:自我监督学习(SSL) 学习表达方式的对比性方法(SSL) 通过将同一数据点(正对)的两种强化观点之间的距离缩小到最小,并从不同数据点(负对)中最大化相同。 然而,最近BYOL和SimSiam等方法显示了显著的性能,而没有负对,提出了一个根本的理论问题:只有正对才能避免表达式崩溃吗?我们研究了在简单的线性网络中非连线性 SSL 的非线性学习动态。我们的分析从概念上洞察了非对调性 SSL 方法是如何学会的,它们如何避免代表性崩溃,以及多重因素,如预测网络、截分级、指数移动平均值和重量衰减等,如何发挥作用。我们的简单理论重新概括了STL-10 和图像网络中真实世界关系研究的结果。此外,我们基于我们的理论,我们提出了一种新颖的方法,即\emph{直接}根据输入的统计来设定预测器。在线性预测器中,我们的方法超越了两个图像序列/直径直线式预测器,我们的方法是用不精确的预测性地进行不精确的预测。