Batch normalization (BN) is a ubiquitous technique for training deep neural networks that accelerates their convergence to reach higher accuracy. However, we demonstrate that BN comes with a fundamental drawback: it incentivizes the model to rely on low-variance features that are highly specific to the training (in-domain) data, hurting generalization performance on out-of-domain examples. In this work, we investigate this phenomenon by first showing that removing BN layers across a wide range of architectures leads to lower out-of-domain and corruption errors at the cost of higher in-domain errors. We then propose Counterbalancing Teacher (CT), a method which leverages a frozen copy of the same model without BN as a teacher to enforce the student network's learning of robust representations by substantially adapting its weights through a consistency loss function. This regularization signal helps CT perform well in unforeseen data shifts, even without information from the target domain as in prior works. We theoretically show in an overparameterized linear regression setting why normalization leads to a model's reliance on such in-domain features, and empirically demonstrate the efficacy of CT by outperforming several baselines on robustness benchmarks such as CIFAR-10-C, CIFAR-100-C, and VLCS.
翻译:批量正常化(BN)是培训深神经网络的一种无处不在的技术,它加速了它们的趋同,从而达到更高的准确性。然而,我们证明,BN有一个根本性的缺点:它鼓励模型依赖与培训(在业)数据非常特别的低差异性能,伤害了外部外体实例的概括性表现。在这项工作中,我们调查这个现象,首先表明,在一系列广泛的结构中去除BN层会导致较低的外体和腐败错误,代价是更高的内体错误。我们然后提议,制衡教师(CT),这种方法利用同一模型的冷冻副本,而没有BN作为教师,通过一致性损失功能大大调整其重量,以强化学生网络对稳健表现的学习。这种校正信号有助于在意外的数据变化中取得良好的效果,即使没有从目标领域获得与以前工程一样的信息。我们理论上表明,过量的线性回归导致一种模型对内部特征的依赖,我们提议采用一种方法,即利用同一模型的僵化副本,而没有BN作为教师,通过一致性功能性基准,通过CFAR-100来显示C的可靠基准。