Inspired by biological evolution, we explain the rationality of Vision Transformer by analogy with the proven practical Evolutionary Algorithm (EA) and derive that both of them have consistent mathematical representation. Analogous to the dynamic local population in EA, we improve the existing transformer structure and propose a more efficient EAT model, and design task-related heads to deal with different tasks more flexibly. Moreover, we introduce the spatial-filling curve into the current vision transformer to sequence image data into a uniform sequential format. Thus we can design a unified EAT framework to address multi-modal tasks, separating the network architecture from the data format adaptation. Our approach achieves state-of-the-art results on the ImageNet classification task compared with recent vision transformer works while having smaller parameters and greater throughput. We further conduct multi-modal tasks to demonstrate the superiority of the unified EAT, e.g., Text-Based Image Retrieval, and our approach improves the rank-1 by +3.7 points over the baseline on the CSS dataset.
翻译:在生物进化的启发下,我们通过比照已证明的实用进化算法(EA)来解释视觉变异器的合理性,并得出两者都有一致的数学代表性。对EA中充满活力的当地人口来说,我们改进了现有的变异器结构,提出了更高效的变异器模型,并设计了与任务有关的任务型号,以便更灵活地处理不同的任务。此外,我们将空间填充曲线引入目前的视觉变异器,将图像数据排序为统一的相继格式。因此,我们可以设计一个统一的EAT框架,处理多模式任务,将网络结构与数据格式适应分开。我们的方法在图像网分类任务上取得了最新艺术成果,与最近的视觉变异器工程相比,我们进行了较小的参数和更大的吞吐量。我们还开展了多模式任务,以展示统一的EAT的优势,例如,文本图像检索器,我们的方法将一级1的改进幅度比CSS数据集的基线高出3.7点。