In this paper, we consider a block coordinate descent (BCD) algorithm for training deep neural networks and provide a new global convergence guarantee under strictly monotonically increasing activation functions. While existing works demonstrate convergence to stationary points for BCD in neural networks, our contribution is the first to prove convergence to global minima, ensuring arbitrarily small loss. We show that the loss with respect to the output layer decreases exponentially while the loss with respect to the hidden layers remains well-controlled. Additionally, we derive generalization bounds using the Rademacher complexity framework, demonstrating that BCD not only achieves strong optimization guarantees but also provides favorable generalization performance. Moreover, we propose a modified BCD algorithm with skip connections and non-negative projection, extending our convergence guarantees to ReLU activation, which are not strictly monotonic. Empirical experiments confirm our theoretical findings, showing that the BCD algorithm achieves a small loss for strictly monotonic and ReLU activations.
翻译:本文研究用于训练深度神经网络的块坐标下降(BCD)算法,并在严格单调递增激活函数下提供了新的全局收敛性保证。现有工作已证明BCD在神经网络中可收敛至驻点,而我们的贡献在于首次证明了其收敛至全局最小值,从而确保损失可达到任意小。我们证明输出层对应的损失呈指数下降,而隐藏层对应的损失始终保持良好控制。此外,我们利用Rademacher复杂度框架推导了泛化界,表明BCD不仅具有强大的优化保证,还能提供优越的泛化性能。进一步地,我们提出了一种带跳跃连接和非负投影的改进BCD算法,将收敛性保证扩展至非严格单调的ReLU激活函数。实证实验验证了我们的理论发现,表明BCD算法在严格单调激活函数和ReLU激活函数下均能实现较小的损失。