通用课程学习在训练视觉主干中的应用 (EfficientTrain: Exploring Generalized Curriculum Learning for Training Visual Backbones)

The superior performance of modern deep networks usually comes with a costly training procedure. This paper presents a new curriculum learning approach for the efficient training of visual backbones (e.g., vision Transformers). Our work is inspired by the inherent learning dynamics of deep networks: we experimentally show that at an earlier training stage, the model mainly learns to recognize some 'easier-to-learn' discriminative patterns within each example, e.g., the lower-frequency components of images and the original information before data augmentation. Driven by this phenomenon, we propose a curriculum where the model always leverages all the training data at each epoch, while the curriculum starts with only exposing the 'easier-to-learn' patterns of each example, and introduces gradually more difficult patterns. To implement this idea, we 1) introduce a cropping operation in the Fourier spectrum of the inputs, which enables the model to learn from only the lower-frequency components efficiently, 2) demonstrate that exposing the features of original images amounts to adopting weaker data augmentation, and 3) integrate 1) and 2) and design a curriculum learning schedule with a greedy-search algorithm. The resulting approach, EfficientTrain, is simple, general, yet surprisingly effective. In the absence of hyper-parameter tuning, it reduces the training wall-time of a wide variety of popular models (e.g., ResNet, ConvNeXt, DeiT, PVT, Swin, and CSWin) by >1.5x on ImageNet-1K/22K without sacrificing the accuracy. It is also effective for self-supervised learning (e.g., MAE). Code is available at https://github.com/LeapLabTHU/EfficientTrain.

翻译：现代深度网络具有卓越的性能，但往往需要昂贵的训练成本。本文介绍了一种新的课程学习方法，用于高效训练视觉主干（例如视觉Transformer）。我们的工作受到深度网络内在的学习动态的启发：我们通过实验证明，在较早的训练阶段，模型主要学习识别每个示例中的一些“更容易学习”的区分模式，例如图像的低频分量和数据增强前的原始信息。在这种现象的推动下，我们提出了一种课程表，其中模型始终在每个时期利用所有训练数据，而课程从仅暴露每个示例的“更容易学习”的模式开始，并逐渐引入更困难的模式。为了实现这一想法，我们 1）引入一个裁剪操作来处理输入的傅里叶谱，从而使模型能够有效地从仅仅低频分量学习，2）证明了展示原始图像的特征等价于采用较弱的数据增强，3）将1）和2）整合起来，并设计了带有贪婪搜索算法的课程学习时间表。由此得到的方法，称为EfficientTrain，简单、通用，但出奇地有效。在不需要超参数调整的情况下，它可以在不降低准确度的情况下，将各种流行模型（例如ResNet、ConvNeXt、DeiT、PVT、Swin和CSWin）在ImageNet-1K/22K（EN: ImageNet-1K/22K）的训练时间缩短>1.5倍。此外，对于自监督学习（例如MAE）也具有较好的效果。代码已公开在https://github.com/LeapLabTHU/EfficientTrain。