CutMix is a popular augmentation technique commonly used for training modern convolutional and transformer vision networks. It was originally designed to encourage Convolution Neural Networks (CNNs) to focus more on an image's global context instead of local information, which greatly improves the performance of CNNs. However, we found it to have limited benefits for transformer-based architectures that naturally have a global receptive field. In this paper, we propose a novel data augmentation technique TokenMix to improve the performance of vision transformers. TokenMix mixes two images at token level via partitioning the mixing region into multiple separated parts. Besides, we show that the mixed learning target in CutMix, a linear combination of a pair of the ground truth labels, might be inaccurate and sometimes counter-intuitive. To obtain a more suitable target, we propose to assign the target score according to the content-based neural activation maps of the two images from a pre-trained teacher model, which does not need to have high performance. With plenty of experiments on various vision transformer architectures, we show that our proposed TokenMix helps vision transformers focus on the foreground area to infer the classes and enhances their robustness to occlusion, with consistent performance gains. Notably, we improve DeiT-T/S/B with +1% ImageNet top-1 accuracy. Besides, TokenMix enjoys longer training, which achieves 81.2% top-1 accuracy on ImageNet with DeiT-S trained for 400 epochs. Code is available at https://github.com/Sense-X/TokenMix.
翻译:CutMix 是一种广受欢迎的增强技术,通常用于培训现代变压器和变压器的视觉网络。最初设计它的目的是鼓励 Convolution NealNets(CNNs) 更多地关注图像的全球背景,而不是本地信息,从而大大改善CNN的性能。然而,我们发现它对于基于变压器的架构的效益有限,这种结构自然具有全球可接受域。在本文中,我们提议一种新型的数据增强技术 TokenMix 来改进视觉变异器的性能。 TokenMix 通过将混合区域分成多个分离部分,在象征性层面混合了两个图像。此外,我们展示了 CutMix 的混合学习目标,即一对地面真相标签的线性组合,可能是不准确的,有时是反直觉的。为了获得更合适的目标,我们提议根据基于内容的以网络变色图绘制目标评分数,这是不需要高性能的。 TokenMix 在各种视觉变形结构上进行大量实验,我们展示了托肯-Mix 更清晰度的图像变现器,从而提升了我们所学的视野变现的图像的视野变校正的成绩。