Training neural networks with batch normalization and weight decay has become a common practice in recent years. In this work, we show that their combined use may result in a surprising periodic behavior of optimization dynamics: the training process regularly exhibits destabilizations that, however, do not lead to complete divergence but cause a new period of training. We rigorously investigate the mechanism underlying the discovered periodic behavior from both empirical and theoretical points of view and analyze the conditions in which it occurs in practice. We also demonstrate that periodic behavior can be regarded as a generalization of two previously opposing perspectives on training with batch normalization and weight decay, namely the equilibrium presumption and the instability presumption.
翻译:在这项工作中,我们表明,合并使用这些网络可能会导致一种令人惊讶的周期性优化动态行为:培训过程经常出现不稳定,但不会导致完全的分歧,而是导致新的培训期。我们从经验和理论角度严格调查所发现的定期行为背后的机制,分析实践中发生的定期行为的条件。我们还表明,定期行为可以被视为对以前关于批次正常化和重量衰减的培训的两个对立观点的概括,即平衡推定和不稳定推定。