Using an extended and formalized version of the Q/C map analysis of Poole et al. (2016), along with Neural Tangent Kernel theory, we identify the main pathologies present in deep networks that prevent them from training fast and generalizing to unseen data, and show how these can be avoided by carefully controlling the "shape" of the network's initialization-time kernel function. We then develop a method called Deep Kernel Shaping (DKS), which accomplishes this using a combination of precise parameter initialization, activation function transformations, and small architectural tweaks, all of which preserve the model class. In our experiments we show that DKS enables SGD training of residual networks without normalization layers on Imagenet and CIFAR-10 classification tasks at speeds comparable to standard ResNetV2 and Wide-ResNet models, with only a small decrease in generalization performance. And when using K-FAC as the optimizer, we achieve similar results for networks without skip connections. Our results apply for a large variety of activation functions, including those which traditionally perform very badly, such as the logistic sigmoid. In addition to DKS, we contribute a detailed analysis of skip connections, normalization layers, special activation functions like RELU and SELU, and various initialization schemes, explaining their effectiveness as alternative (and ultimately incomplete) ways of "shaping" the network's initialization-time kernel.
翻译:我们利用Poole等人的Q/C地图分析的扩展和正式版本以及Neoral Tangent Kernel理论,确定深网络中存在的主要病理,这些病理使它们无法训练快速和普及到不可见的数据,并表明如何通过仔细控制网络初始化-时间内核功能的“形状”来避免这些病理。然后我们开发了一种称为Deep Kernel 形状(DKS)的方法,这种方法结合精确参数初始化、激活功能转换和小型建筑图案,所有这些都保留了模型类。在我们的实验中,我们显示DKS能够使SGD培训在图像网和CIFAR-10的分类任务中,其速度与标准 ResNet2 和宽ResNet 模型相比,只是略微降低总体化性功能。当我们使用K-FAC(Deep Kerneel Shaping) 的优化功能时,我们取得了类似的结果。我们的结果适用于大量启动功能,包括传统上表现非常糟糕的功能,例如物流系统等。我们展示了Sigmainal 初步和SL 等特殊的升级化功能,我们最终对SEVL 提供了一种特殊的升级化方法。