While contrastive approaches of self-supervised learning (SSL) learn representations by minimizing the distance between two augmented views of the same data point (positive pairs) and maximizing views from different data points (negative pairs), recent \emph{non-contrastive} SSL (e.g., BYOL and SimSiam) show remarkable performance {\it without} negative pairs, with an extra learnable predictor and a stop-gradient operation. A fundamental question arises: why do these methods not collapse into trivial representations? We answer this question via a simple theoretical study and propose a novel approach, DirectPred, that \emph{directly} sets the linear predictor based on the statistics of its inputs, without gradient training. On ImageNet, it performs comparably with more complex two-layer non-linear predictors that employ BatchNorm and outperforms a linear predictor by $2.5\%$ in 300-epoch training (and $5\%$ in 60-epoch). DirectPred is motivated by our theoretical study of the nonlinear learning dynamics of non-contrastive SSL in simple linear networks. Our study yields conceptual insights into how non-contrastive SSL methods learn, how they avoid representational collapse, and how multiple factors, like predictor networks, stop-gradients, exponential moving averages, and weight decay all come into play. Our simple theory recapitulates the results of real-world ablation studies in both STL-10 and ImageNet. Code is released\footnote{\url{https://github.com/facebookresearch/luckmatters/tree/master/ssl}}.
翻译:自我监督学习( SSL) 的反比方法在自我监督学习( SSL) 学习表达方式, 通过将同一数据点( 阳性对) 的两种强化观点之间的距离最小化, 以及不同数据点( 负对) 的最大化观点之间的距离最小化, 最新的 SSL( 例如 BYOL 和 SimSiam) 显示显著的性能, 没有负对, 增加一个可学习的预测器, 并且有一个停止的操作 。 一个根本性的问题 : 为什么这些方法不会崩溃成微不足道的表达方式? 我们通过简单的理论研究来回答这个问题, 并且提出一个新的方法, 即 直接准备, 即\ emph{ 直接} 以输入的统计为基础, 建立线性预测器。 在图像网络上, 它与更复杂的两层非线性非线性非线性非线性预测器比较, 使用 BatchNorm, 在300 度培训中将线性预测器比2.5美元( 和60美元 美元 美元 方向 ) 。 。 直接预言 由我们理论研究, 的理论/ 的理论/ 递增 系统如何在不线性 递增变变变变 的 的 的 。