The vision transformer (ViT) has achieved state-of-the-art results in various vision tasks. It utilizes a learnable position embedding (PE) mechanism to encode the location of each image patch. However, it is presently unclear if this learnable PE is really necessary and what its benefits are. This paper explores two alternative ways of encoding the location of individual patches that exploit prior knowledge about their spatial arrangement. One is called the sequence relationship embedding (SRE), and the other is called the circle relationship embedding(CRE). Among them, the SRE considers all patches to be in order, and adjacent patches have the same interval distance. The CRE considers the central patch as the center of the circle and measures the distance of the remaining patches from the center based on the four neighborhoods principle. Multiple concentric circles with different radii combine different patches. Finally, we implemented these two relations on three classic ViTs and tested them on four popular datasets. Experiments show that SRE and CRE can replace PE to reduce the random learnable parameters while achieving the same performance. Combining SRE or CRE with PE gets better performance than only using PE.
翻译:视觉变压器( VIT) 在各种视觉任务中实现了最新艺术效果。 它使用一个可学习位置嵌入( PE) 机制来编码每个图像补丁的位置。 但是, 目前还不清楚这个可学习的 PE 是否真正必要, 其好处是什么 。 本文探索了两种将单个补丁的位置编码的替代方法, 这些补丁利用了先前对各自空间安排的了解。 一个称为序列嵌入( SRE), 另一个称为循环嵌入( CRE) 。 其中, SRE 认为所有补丁都井然有序, 相邻的补丁有相同的间隔距离 。 CRE 将中央补丁视为圆圈的中心, 并根据四个周边原则测量其余补丁的距离 。 与不同的射线组合的多共心圈。 最后, 我们在三个经典维特上实施了这两个关系, 并在四个流行数据集上测试了它们。 实验显示, SRE 和 CRE 能够替换 PE 来减少随机学习参数, 而同时实现同样的性能比 将 SRE 与 PE 合并成更好的业绩 。