While the empirical success of self-supervised learning (SSL) heavily relies on the usage of deep nonlinear models, many theoretical works proposed to understand SSL still focus on linear ones. In this paper, we study the role of nonlinearity in the training dynamics of contrastive learning (CL) on one and two-layer nonlinear networks with homogeneous activation $h(x) = h'(x)x$. We theoretically demonstrate that (1) the presence of nonlinearity leads to many local optima even in 1-layer setting, each corresponding to certain patterns from the data distribution, while with linear activation, only one major pattern can be learned; and (2) nonlinearity leads to specialized weights into diverse patterns, a behavior that linear activation is proven not capable of. These findings suggest that models with lots of parameters can be regarded as a \emph{brute-force} way to find these local optima induced by nonlinearity, a possible underlying reason why empirical observations such as the lottery ticket hypothesis hold. In addition, for 2-layer setting, we also discover \emph{global modulation}: those local patterns discriminative from the perspective of global-level patterns are prioritized to learn, further characterizing the learning process. Simulation verifies our theoretical findings.
翻译:虽然自我监督学习(SSL)的成功经验在很大程度上依赖于使用深非线性模型,但许多为理解SSL而提出的许多理论工程仍然侧重于线性模型。在本文中,我们研究了非线性在一和两层非线性网络的对比学习培训动态中所起的作用。一和两层非线性网络具有同质活性($h(x)= h'(x)x$)。我们理论上表明:(1)非线性的存在甚至导致许多本地选择,即使是在一层环境中,每个都与数据分布中的某些模式相对应,而线性激活则只能学习一种主要模式;(2)非线性导致特殊重量进入不同模式,而线性激活已证明无法进行。这些结论表明,具有许多参数的模型可以被视为一种由非线性诱导的本地选择性方法,这是进行彩票假设等实验性观察的潜在根本原因。此外,对于二层环境而言,我们还发现\emph{全球调制模式;以及(2)非线性模式导致不同模式的特制性研究过程:从全球范围学习的典型性理论性研究过程。