We establish conditions under which gradient descent applied to fixed-width deep networks drives the logistic loss to zero, and prove bounds on the rate of convergence. Our analysis applies for smoothed approximations to the ReLU, such as Swish and the Huberized ReLU, proposed in previous applied work. We provide two sufficient conditions for convergence. The first is simply a bound on the loss at initialization. The second is a data separation condition used in prior analyses.
翻译:我们的分析适用于先前应用工作所提议的与ReLU的平滑近似值,如Shish和Huberized ReLU。我们为趋同提供了两个充分的条件。第一是初始化时损失的界限。第二是先前分析中使用的数据分离条件。