Typical neural network trainings have substantial variance in test-set performance between repeated runs, impeding hyperparameter comparison and training reproducibility. We present the following results towards understanding this variation. (1) Despite having significant variance on their test-sets, we demonstrate that standard CIFAR-10 and ImageNet trainings have very little variance in their performance on the test-distributions from which those test-sets are sampled, suggesting that variance is less of a practical issue than previously thought. (2) We present a simplifying statistical assumption which closely approximates the structure of the test-set accuracy distribution. (3) We argue that test-set variance is inevitable in the following two senses. First, we show that variance is largely caused by high sensitivity of the training process to initial conditions, rather than by specific sources of randomness like the data order and augmentations. Second, we prove that variance is unavoidable given the observation that ensembles of trained networks are well-calibrated. (4) We conduct preliminary studies of distribution-shift, fine-tuning, data augmentation and learning rate through the lens of variance between runs.
翻译:典型的神经网络训练在不同运行之间具有大量的测试集表现方差,这影响了超参数比较和训练可重复性。本文研究以下结果,以便更好地理解这种差异。首先,尽管在测试集上有显着差异,但我们证明标准的CIFAR-10和ImageNet训练在它们从中抽取测试集的测试分布上的表现几乎没有差异,这表明方差不如先前想象中实用。其次,我们提出了一个简化的统计假设,它很好地近似了测试集准确度分布的结构。第三,我们认为测试集方差在以下两个意义上是不可避免的。首先,我们展示了方差主要是由于训练过程对初始条件高度敏感而引起的,而不是由于数据顺序和增强等特定的随机源。其次,我们证明了在观察到训练的网络集是良性校准的情况下,方差是不可避免的。最后,我们通过方差在不同运行之间进行初步的分布变化、微调、数据增强和学习速率的研究。