Transformer is a potentially powerful architecture for vision tasks. Although equipped with more parameters and attention mechanism, its performance is not as dominant as CNN currently. CNN is usually computationally cheaper and still the leading competitor in various vision tasks. One research direction is to adopt the successful ideas of CNN and improve transformer, but it often relies on elaborated and heuristic network design. Observing that transformer and CNN are complementary in representation learning and convergence speed, we propose an efficient training framework called Vision Pair Learning (VPL) for image classification task. VPL builds up a network composed of a transformer branch, a CNN branch and pair learning module. With multi-stage training strategy, VPL enables the branches to learn from their partners during the appropriate stage of the training process, and makes them both achieve better performance with less time cost. Without external data, VPL promotes the top-1 accuracy of ViT-Base and ResNet-50 on the ImageNet-1k validation set to 83.47% and 79.61% respectively. Experiments on other datasets of various domains prove the efficacy of VPL and suggest that transformer performs better when paired with the differently structured CNN in VPL. we also analyze the importance of components through ablation study.
翻译:变压器是具有潜在强大潜力的愿景任务架构。 虽然它拥有更多的参数和关注机制,但其性能不如CNN目前具有的主导性。CNN通常在计算上更便宜,仍然是各种愿景任务中的主要竞争对手。一个研究方向是采纳CNN的成功想法,改进变压器,但往往依赖精心制定和修剪的网络设计。观察变压器和CNN在代表性学习和趋同速度方面互为补充,我们建议一个高效的培训框架,即图像分类任务“View Pair Learning (VPL)” 。VPL建立了一个由变压器分支、CNN分支和对口学习模块组成的网络。通过多阶段培训战略,VPL使各分支能够在培训过程的适当阶段向合作伙伴学习,使其以较低的时间成本取得更好的业绩。没有外部数据,VPL将VT-Base和ResNet-50在图像网-1k验证器上的最高至83.47%和79.61%的精度。其他领域数据集的实验证明了VPL的功效,并建议变压器在与结构不同的CNNPL组件配对重要时,我们通过VPL分析。