Implicit deep learning has recently become popular in the machine learning community since these implicit models can achieve competitive performance with state-of-the-art deep networks while using significantly less memory and computational resources. However, our theoretical understanding of when and how first-order methods such as gradient descent (GD) converge on \textit{nonlinear} implicit networks is limited. Although this type of problem has been studied in standard feed-forward networks, the case of implicit models is still intriguing because implicit networks have \textit{infinitely} many layers. The corresponding equilibrium equation probably admits no or multiple solutions during training. This paper studies the convergence of both gradient flow (GF) and gradient descent for nonlinear ReLU activated implicit networks. To deal with the well-posedness problem, we introduce a fixed scalar to scale the weight matrix of the implicit layer and show that there exists a small enough scaling constant, keeping the equilibrium equation well-posed throughout training. As a result, we prove that both GF and GD converge to a global minimum at a linear rate if the width $m$ of the implicit network is \textit{linear} in the sample size $N$, i.e., $m=\Omega(N)$.
翻译:由于这些隐含的深层学习在机器学习界最近变得很受欢迎,因为这些隐含的模型可以在使用大量记忆和计算资源的同时,通过最先进的深层网络取得竞争性的绩效。然而,我们对梯度下降(GD)等第一阶方法何时以及如何在\ textit{nonlinear}隐含网络上趋同的理论理解是有限的。虽然在标准的进料前进网络中已经研究过这类类型的问题,但隐含的模型仍然令人感兴趣,因为隐含的网络有许多层次。相应的平衡方程式在培训期间可能没有接受任何或多种解决办法。本文研究了非线性ReLU激活的隐含网络的梯度流动(GF)和梯度下降的趋同程度。为了处理隐含层的重力问题,我们引入了固定的斜度,以缩小隐含层的权重矩阵,并表明在培训期间保持平衡方程式。结果证明,如果隐含网络的宽度为$美元,则GF和GD均为全球最低线率。