We study the vision transformer structure in the mobile level in this paper, and find a dramatic performance drop. We analyze the reason behind this phenomenon, and propose a novel irregular patch embedding module and adaptive patch fusion module to improve the performance. We conjecture that the vision transformer blocks (which consist of multi-head attention and feed-forward network) are more suitable to handle high-level information than low-level features. The irregular patch embedding module extracts patches that contain rich high-level information with different receptive fields. The transformer blocks can obtain the most useful information from these irregular patches. Then the processed patches pass the adaptive patch merging module to get the final features for the classifier. With our proposed improvements, the traditional uniform vision transformer structure can achieve state-of-the-art results in mobile level. We improve the DeiT baseline by more than 9\% under the mobile-level settings and surpass other transformer architectures like Swin and CoaT by a large margin.
翻译:我们在本文中研究移动层面的视觉变压器结构, 并发现一个显著的性能下降。 我们分析这一现象背后的原因, 并提出一个新的非常规补丁嵌入模块和适应性补充组合模块来改善性能。 我们推测, 视觉变压器块( 由多头关注和饲料前向网络组成) 更适合处理高层次信息, 而不是低层次特征 。 不正常的补丁嵌入模块提取了含有不同可接受字段的丰富高层次信息的补丁。 变压器块可以从这些非常规补丁中获得最有用的信息。 然后, 加工的补丁通过适应性补装合并模块来获得分类器的最终功能 。 随着我们提议的改进, 传统统一的视觉变压器结构可以在移动层面实现最新艺术效果 。 我们在移动级别设置下将DeiT 基线改善9个以上, 并大大超过 Swin 和 CoaT 等其他变压器结构 。