有效愿景转型培训:数据中心视角 (Effective Vision Transformer Training: A Data-Centric Perspective)

Vision Transformers (ViTs) have shown promising performance compared with Convolutional Neural Networks (CNNs), but the training of ViTs is much harder than CNNs. In this paper, we define several metrics, including Dynamic Data Proportion (DDP) and Knowledge Assimilation Rate (KAR), to investigate the training process, and divide it into three periods accordingly: formation, growth and exploration. In particular, at the last stage of training, we observe that only a tiny portion of training examples is used to optimize the model. Given the data-hungry nature of ViTs, we thus ask a simple but important question: is it possible to provide abundant ``effective'' training examples at EVERY stage of training? To address this issue, we need to address two critical questions, \ie, how to measure the ``effectiveness'' of individual training examples, and how to systematically generate enough number of ``effective'' examples when they are running out. To answer the first question, we find that the ``difficulty'' of training samples can be adopted as an indicator to measure the ``effectiveness'' of training samples. To cope with the second question, we propose to dynamically adjust the ``difficulty'' distribution of the training data in these evolution stages. To achieve these two purposes, we propose a novel data-centric ViT training framework to dynamically measure the ``difficulty'' of training samples and generate ``effective'' samples for models at different training stages. Furthermore, to further enlarge the number of ``effective'' samples and alleviate the overfitting problem in the late training stage of ViTs, we propose a patch-level erasing strategy dubbed PatchErasing. Extensive experiments demonstrate the effectiveness of the proposed data-centric ViT training framework and techniques.

翻译：视觉变异器( VIT) 显示与革命神经网络( Convolutional Neal Networks (CNNNs) 相比, 表现良好, 但VIT的训练比CNN要难得多。在本文中, 我们定义了包括动态数据比例( DDP) 和知识同化率( KAR) 在内的若干衡量标准, 以调查培训过程, 并据此将其分为三个阶段: 形成、增长和探索。特别是, 在培训的最后阶段, 我们观察到, 仅用少量的培训例子来优化模型。鉴于 VITs的数据样本的精度性质, 我们因此问一个简单但重要的问题: 是否有可能在每个培训阶段都提供大量的“ 有效” 培训范例? 要解决这个问题, 我们需要解决两个关键问题, 如何测量单个培训实例的效能, 以及如何系统生成足够数量的“ 有效” 实例。为了回答第一个问题, 我们发现, 不断进化的培训样本的精度技术的精度和第二阶段, 我们可以用“ 变化的精度” 来测量培训的精度的精度值, 在两个阶段里, 我们用这些测试的精度的样本的精度分析阶段, 来提出这些样本的精度分析。