Transformers have recently gained increasing attention in computer vision. However, existing studies mostly use Transformers for feature representation learning, e.g. for image classification and dense predictions. In this work, we further investigate the possibility of applying Transformers for image matching and metric learning given pairs of images. We find that the Vision Transformer (ViT) and the vanilla Transformer with decoders are not adequate for image matching due to their lack of image-to-image attention. Thus, we further design two naive solutions, i.e. query-gallery concatenation in ViT, and query-gallery cross-attention in the vanilla Transformer. The latter improves the performance, but it is still limited. This implies that the attention mechanism in Transformers is primarily designed for global feature aggregation, which is not naturally suitable for image matching. Accordingly, we propose a new simplified decoder, which drops the full attention implementation with the softmax weighting, keeping only the query-key similarity computation. Additionally, global max pooling and a multilayer perceptron (MLP) head are applied to decode the matching result. This way, the simplified decoder is computationally more efficient, while at the same time more effective for image matching. The proposed method, called TransMatcher, achieves state-of-the-art performance in generalizable person re-identification, with up to 6.1% and 5.7% performance gains in Rank-1 and mAP, respectively, on several popular datasets. The source code of this study will be made publicly available.
翻译:最近,计算机的变异器在计算机视野中引起了越来越多的关注。然而,现有的研究大多使用变异器进行特征描述学习,例如图像分类和密集预测。在这项工作中,我们进一步调查应用变异器进行图像匹配和计量学习给定图像组合的可能性。我们发现,“变异器”和配有解码器的香草变异器由于缺乏图像到图像的注意,不足以进行图像匹配。因此,我们进一步设计两种天真的解决方案,即ViT中的查询-加热相接合,以及香草变异器中的查询-加热交叉关注。后者改进了性能,但仍然有限。这意味着,变异器中的注意机制主要设计为全球特性组合,这自然不适合图像匹配。因此,我们建议采用一个新的简化的解码器,通过软度加权来降低对图像的完全关注度,只保持查询-相似的计算。此外,全球最大源集和多层透镜(MLP)头将分别应用到变异式变异式变异式变异器的性动作,同时进行更高效的变异性变式的变式变式变式变式的变式变式的变式计算。这个变式变式的变式的变式的变式的变式的变式的变式的变式的变式的变式的变式的变式的计算方法被分别在了。