While the empirical success of self-supervised learning (SSL) heavily relies on the usage of deep nonlinear models, existing theoretical works on SSL understanding still focus on linear ones. In this paper, we study the role of nonlinearity in the training dynamics of contrastive learning (CL) on one and two-layer nonlinear networks with homogeneous activation $h(x) = h'(x)x$. We have two major theoretical discoveries. First, the presence of nonlinearity can lead to many local optima even in 1-layer setting, each corresponding to certain patterns from the data distribution, while with linear activation, only one major pattern can be learned. This suggests that models with lots of parameters can be regarded as a \emph{brute-force} way to find these local optima induced by nonlinearity. Second, in the 2-layer case, linear activation is proven not capable of learning specialized weights into diverse patterns, demonstrating the importance of nonlinearity. In addition, for 2-layer setting, we also discover \emph{global modulation}: those local patterns discriminative from the perspective of global-level patterns are prioritized to learn, further characterizing the learning process. Simulation verifies our theoretical findings.
翻译:虽然自我监督学习(SSL)的成功经验在很大程度上依赖于深非线性模型的使用,但现有的SSL理解理论工作仍然侧重于线性模型。在本文件中,我们研究了在一和二层非线性网络上,单一活度为$h(x)=h(x)xx$的对比学习(CL)的培训动态中,非线性作用的作用。我们有两个主要的理论发现。第一,即使存在非线性,即使在一层环境中,也可能导致许多本地的opima,每个都与数据分布中的某些模式相对应,而线性激活则只能学习一个主要模式。这说明,具有多种参数的模型可被视为找到非线性引发的本地非线性学习动态的方法。第二,在二层中,线性激活被证明无法学习不同模式的专门权重,表明非线性的重要性。此外,对于二层环境,我们还发现\emph{global malmodel:从模拟的模型学习到模拟性水平的模型。</s>