Prior works have proposed several strategies to reduce the computational cost of self-attention mechanism. Many of these works consider decomposing the self-attention procedure into regional and local feature extraction procedures that each incurs a much smaller computational complexity. However, regional information is typically only achieved at the expense of undesirable information lost owing to down-sampling. In this paper, we propose a novel Transformer architecture that aims to mitigate the cost issue, named Dual Vision Transformer (Dual-ViT). The new architecture incorporates a critical semantic pathway that can more efficiently compress token vectors into global semantics with reduced order of complexity. Such compressed global semantics then serve as useful prior information in learning finer pixel level details, through another constructed pixel pathway. The semantic pathway and pixel pathway are then integrated together and are jointly trained, spreading the enhanced self-attention information in parallel through both of the pathways. Dual-ViT is henceforth able to reduce the computational complexity without compromising much accuracy. We empirically demonstrate that Dual-ViT provides superior accuracy than SOTA Transformer architectures with reduced training complexity. Source code is available at \url{https://github.com/YehLi/ImageNetModel}.
翻译:先前的作品提出了几项降低自留机制计算成本的战略,其中许多工程考虑将自留程序分解为区域和地方地物提取程序,而每个程序在计算上的复杂性要小得多。然而,区域信息通常只能以降低自留程序损失的不良信息为代价实现。在本论文中,我们提议了一个旨在缓解成本问题的新型变异器结构,名为“双重愿景变异器(Dual-ViT) ” 。新的结构包含一个关键语义路径,可以更有效地将代号矢量压缩成复杂程度较低的全球语义。这种压缩的全球语义学随后成为学习精细像级细节的有用信息,通过另一个构建的像素路径。语义路径和像素路径随后合并并经过联合培训,通过两个路径平行传播强化的自留信息。二元ViT今后能够降低计算复杂性,而不会损害很多准确性。我们的经验显示,二维T提供比SITA变异源结构更精准,而培训复杂程度较低。MUFFL{ML} 源代码可在两个路径上查到。