TokenMix：重新思考视觉Transformer中的图像混合数据增强 (TokenMix: Rethinking Image Mixing for Data Augmentation in Vision Transformers)

CutMix is a popular augmentation technique commonly used for training modern convolutional and transformer vision networks. It was originally designed to encourage Convolution Neural Networks (CNNs) to focus more on an image's global context instead of local information, which greatly improves the performance of CNNs. However, we found it to have limited benefits for transformer-based architectures that naturally have a global receptive field. In this paper, we propose a novel data augmentation technique TokenMix to improve the performance of vision transformers. TokenMix mixes two images at token level via partitioning the mixing region into multiple separated parts. Besides, we show that the mixed learning target in CutMix, a linear combination of a pair of the ground truth labels, might be inaccurate and sometimes counter-intuitive. To obtain a more suitable target, we propose to assign the target score according to the content-based neural activation maps of the two images from a pre-trained teacher model, which does not need to have high performance. With plenty of experiments on various vision transformer architectures, we show that our proposed TokenMix helps vision transformers focus on the foreground area to infer the classes and enhances their robustness to occlusion, with consistent performance gains. Notably, we improve DeiT-T/S/B with +1% ImageNet top-1 accuracy. Besides, TokenMix enjoys longer training, which achieves 81.2% top-1 accuracy on ImageNet with DeiT-S trained for 400 epochs. Code is available at https://github.com/Sense-X/TokenMix.

翻译：CutMix是一种广泛应用于现代卷积神经网络和Transformer视觉网络训练的受欢迎增强技术。它最初是为了鼓励CNNs更多地关注图像的全局上下文而不是局部信息，这极大地提高了CNNs的性能。然而，我们发现对于自然具有全局感受野的基于Transformer的架构，其效果受限。在本文中，我们提出了一种新的数据增强技术TokenMix，以提高视觉Transformer的性能。TokenMix通过将混合区域划分为多个独立部分，在标记级别上混合两幅图像。此外，我们还展示了在CutMix中的混合学习目标，一个地面真值标签对的线性组合，可能不准确，有时甚至是反直觉的。为了获得更合适的目标，我们提出根据来自经过预训练teacher模型的两幅图像的基于内容的神经激活映射来分配目标分数，它不需要具有高性能。通过各种视觉Transformer架构的大量实验，我们展示了我们提出的TokenMix帮助视觉Transformer聚焦于前景区域来推断类别，并增强他们对遮挡的鲁棒性，具有一致的性能提升。值得注意的是，我们提高了在DeiT-T/S/B上的ImageNet top-1准确度+1%。此外，TokenMix享有更长的训练时间，使DeiT-S在400个epoch训练后的ImageNet top-1准确度达到81.2%。代码可在https://github.com/Sense-X/TokenMix找到。