Vision transformers have achieved remarkable progress in vision tasks such as image classification and detection. However, in instance-level image retrieval, transformers have not yet shown good performance compared to convolutional networks. We propose a number of improvements that make transformers outperform the state of the art for the first time. (1) We show that a hybrid architecture is more effective than plain transformers, by a large margin. (2) We introduce two branches collecting global (classification token) and local (patch tokens) information, from which we form a global image representation. (3) In each branch, we collect multi-layer features from the transformer encoder, corresponding to skip connections across distant layers. (4) We enhance locality of interactions at the deeper layers of the encoder, which is the relative weakness of vision transformers. We train our model on all commonly used training sets and, for the first time, we make fair comparisons separately per training set. In all cases, we outperform previous models based on global representation. Public code is available at https://github.com/dealicious-inc/DToP.
翻译:视觉变压器在图像分类和探测等视觉任务方面取得了显著进展,然而,在实例一级图像检索方面,变压器尚未表现出与革命网络相比的良好性能。我们提出若干改进建议,使变压器首次优于最先进的状态。 (1) 我们显示混合结构比普通变压器更有效,有很大的幅度。 (2) 我们引入两个分支,收集全球(分类标志)和地方(批量标牌)信息,从而形成全球图像代表。 (3) 在每一个分支中,我们从变压器编码器收集多层特征,对应于跨越遥远层的连接。 (4) 我们提高变压器更深层的互动位置,这是视觉变压器的相对弱点。 我们用所有常用的培训成套方法来培训我们的模型,并首次对每套培训进行公平的比较。 在所有情况下,我们都是根据全球代表制,比以前的模型更优。 公共代码见https://github.com/dealicialian-inc/DOP。