The superior performance of modern deep networks usually comes at the price of a costly training procedure. In this paper, we present a novel curriculum learning approach for the efficient training of visual backbones (e.g., vision Transformers). The proposed method is inspired by the phenomenon that deep networks mainly learn to recognize some 'easier-to-learn' discriminative patterns within each example at earlier stages of training, e.g., the lower-frequency components of images and the original information before data augmentation. Driven by this observation, we propose a curriculum where the model always leverages all the training data at each epoch, while the curriculum starts with only exposing the 'easier-to-learn' patterns of each example, and introduces gradually more difficult patterns. To implement this idea, we 1) introduce a cropping operation in the Fourier spectrum of the inputs, which enables the model to learn from only the lower-frequency components efficiently, and 2) demonstrate that exposing the features of original images amounts to adopting weaker data augmentation. Our resulting algorithm, EfficientTrain, is simple, general, yet surprisingly effective. For example, it reduces the training time of a wide variety of popular models (e.g., ConvNeXts, DeiT, PVT, and Swin/CSWin Transformers) by more than ${1.5\times}$ on ImageNet-1K/22K without sacrificing the accuracy. It is effective for self-supervised learning (i.e., MAE) as well. Code is available at https://github.com/LeapLabTHU/EfficientTrain.
翻译:现代深层次网络的优异性能通常以昂贵的培训程序为代价。 在本文中,我们提出了一个新的课程学习方法,用于高效培训视觉骨干(例如,视觉变异器)的高效培训。建议的方法受到以下现象的启发:深层次网络主要在培训的最初阶段在每一例中都学会承认某些“容易读取”的歧视性模式,例如,图像的低频组成部分和数据增强之前的原始信息。在这项观察的推动下,我们建议了一个课程,模型总是在每一个地方利用所有的培训数据,而课程开始时只暴露每个实例的“易读到读”模式,并逐渐引入更困难的模式。为了实施这一理念,我们1)在Fourier系列的投入中引入了一种裁剪裁操作,使模型能够仅仅从低频组成部分中学习,2 显示暴露原始图像的特征等于采用较弱的数据增强。我们得出的算法, 高效的塔林, 简单,但令人惊讶地是有效的。例如,它减少了SNEVLK/SFAR的准确性, 也就是不具有广泛大众模型的培训时间。