A wide variety of deep learning techniques from style transfer to multitask learning rely on training affine transformations of features. Most prominent among these is the popular feature normalization technique BatchNorm, which normalizes activations and then subsequently applies a learned affine transform. In this paper, we aim to understand the role and expressive power of affine parameters used to transform features in this way. To isolate the contribution of these parameters from that of the learned features they transform, we investigate the performance achieved when training only these parameters in BatchNorm and freezing all weights at their random initializations. Doing so leads to surprisingly high performance considering the significant limitations that this style of training imposes. For example, sufficiently deep ResNets reach 82% (CIFAR-10) and 32% (ImageNet, top-5) accuracy in this configuration, far higher than when training an equivalent number of randomly chosen parameters elsewhere in the network. BatchNorm achieves this performance in part by naturally learning to disable around a third of the random features. Not only do these results highlight the expressive power of affine parameters in deep learning, but - in a broader sense - they characterize the expressive power of neural networks constructed simply by shifting and rescaling random features.
翻译:从风格传输到多任务学习的多种深层次学习技巧都依赖于对各种特征的训练,其中最突出的是流行的地貌正常化技术BatchNorm, 使激活正常化, 并随后应用学习的方形变换。 在本文中, 我们的目标是理解用于以这种方式转换特征的方形参数的作用和表达力。 为了将这些参数的贡献与它们所学特性的变异区分开来, 我们调查这些参数仅通过在BatchNorm中培训这些参数和在随机初始化时冻结所有重量而实现的性能。 考虑到这种培训方式带来的重大限制, 最突出的就是惊人的性能。 例如, 足够深的ResNet达到82% (CIFAR- 10) 和32% (ImagiNet, 顶级 5) 的这种配置的精度, 远高于在网络其他地方培训同等数量的随机选择参数时的精度。 BatchNorm通过自然学习约三分之一的随机特性来达到这一性能。 这些结果不仅突出深层次学习的直系参数的显性力量, 并且从更广义的感应变的内感性地描述它们。