Pure transformers have shown great potential for vision tasks recently. However, their accuracy in small or medium datasets is not satisfactory. Although some existing methods introduce a CNN as a teacher to guide the training process by distillation, the gap between teacher and student networks would lead to sub-optimal performance. In this work, we propose a new One-shot Vision transformer search framework with Online distillation, namely OVO. OVO samples sub-nets for both teacher and student networks for better distillation results. Benefiting from the online distillation, thousands of subnets in the supernet are well-trained without extra finetuning or retraining. In experiments, OVO-Ti achieves 73.32% top-1 accuracy on ImageNet and 75.2% on CIFAR-100, respectively.
翻译:纯变压器最近表现出了巨大的视觉任务潜力,然而,其在中小数据集中的准确性并不令人满意。虽然一些现有方法引入了CNN作为教师,通过蒸馏来指导培训过程,但教师与学生网络之间的差距会导致业绩低于最佳水平。在这项工作中,我们提议建立一个新的单射光的视觉变压器搜索框架,通过在线蒸馏,即OVO。OVO为教师和学生网络样本子网,以便取得更好的蒸馏结果。从网上蒸馏中受益,但超网的数千个子网经过良好训练,没有额外的微调或再培训。在实验中,OVO-Ti在图像网络上分别实现了73.32%的顶端-1的精确度,在CIFAR-100上实现了75.2%。