Deep hamming hashing has gained growing popularity in approximate nearest neighbour search for large-scale image retrieval. Until now, the deep hashing for the image retrieval community has been dominated by convolutional neural network architectures, e.g. \texttt{Resnet}\cite{he2016deep}. In this paper, inspired by the recent advancements of vision transformers, we present \textbf{Transhash}, a pure transformer-based framework for deep hashing learning. Concretely, our framework is composed of two major modules: (1) Based on \textit{Vision Transformer} (ViT), we design a siamese vision transformer backbone for image feature extraction. To learn fine-grained features, we innovate a dual-stream feature learning on top of the transformer to learn discriminative global and local features. (2) Besides, we adopt a Bayesian learning scheme with a dynamically constructed similarity matrix to learn compact binary hash codes. The entire framework is jointly trained in an end-to-end manner.~To the best of our knowledge, this is the first work to tackle deep hashing learning problems without convolutional neural networks (\textit{CNNs}). We perform comprehensive experiments on three widely-studied datasets: \textbf{CIFAR-10}, \textbf{NUSWIDE} and \textbf{IMAGENET}. The experiments have evidenced our superiority against the existing state-of-the-art deep hashing methods. Specifically, we achieve 8.2\%, 2.6\%, 12.7\% performance gains in terms of average \textit{mAP} for different hash bit lengths on three public datasets, respectively.
翻译:在近邻寻找大型图像检索的近距离近距离搜索中,深深 heming hashing越来越受欢迎。 直到现在为止,图像检索界的深 hashing一直以 convolual 神经网络结构为主,例如\ textt{Resnet ⁇ cite{he2016deep}。在本文中,在视觉变异器最近进步的启发下,我们提出\ textbf{transhash},一个纯粹的基于变压器的框架,用于深度散列学习。具体地说,我们的框架由两个主要模块组成:(1) 基于 textit{ Vision 变异( ViT),我们设计了一个用于图像提取的直观视觉变异主骨。为了学习精细的特性,我们在变异变异的变异器上发明了一个双流的特性学习。(2) 此外,我们采用了一种基于动态构建的类似矩阵的贝亚的学习机制,以学习紧凑的二进制方法。整个框架由两个主要模块组成:(1) 以最终方式共同训练 。~ 至最佳的图像变异网络, 实现我们最深的变现的运行。