Deep learning has shown a tremendous growth in hashing techniques for image retrieval. Recently, Transformer has emerged as a new architecture by utilizing self-attention without convolution. Transformer is also extended to Vision Transformer (ViT) for the visual recognition with a promising performance on ImageNet. In this paper, we propose a Vision Transformer based Hashing (VTS) for image retrieval. We utilize the pre-trained ViT on ImageNet as the backbone network and add the hashing head. The proposed VTS model is fine tuned for hashing under six different image retrieval frameworks, including Deep Supervised Hashing (DSH), HashNet, GreedyHash, Improved Deep Hashing Network (IDHN), Deep Polarized Network (DPN) and Central Similarity Quantization (CSQ) with their objective functions. We perform the extensive experiments on CIFAR10, ImageNet, NUS-Wide, and COCO datasets. The proposed VTS based image retrieval outperforms the recent state-of-the-art hashing techniques with a great margin. We also find the proposed VTS model as the backbone network is better than the existing networks, such as AlexNet and ResNet.
翻译:深层学习显示图像检索的散列技术有了巨大的增长。 最近, 变异器通过使用不演化的自我注意, 成为一个新的结构。 变异器还扩展至视觉变异器( VT), 在图像网络上进行视觉化变异器( VTS ), 在图像网络上进行图像检索。 我们在图像网络上将预先训练的VT作为主干网, 并添加散列头。 拟议的VTS模型在六个不同的图像检索框架下, 包括深超超音速哈辛( DSH) 、 哈斯网、 贪婪哈什( GreedyHash) 、 改进深深层散列网络( IDHN) 、 深极化网络(DPN) 和中央相似度( CSQ), 以其客观功能为根据。 我们在图像网络10、 图像网、 NUS- Wide 和 CO 数据集上进行广泛的实验。 拟议的VTS 模型检索超越了最近的状态艺术技术, 我们发现现有的网络, 如 亚历克斯 。