This paper introduces a plug-and-play descriptor that can be effectively adopted for image retrieval tasks without prior initialization or preparation. The description method utilizes the recently proposed Vision Transformer network while it does not require any training data to adjust parameters. In image retrieval tasks, the use of Handcrafted global and local descriptors has been very successfully replaced, over the last years, by the Convolutional Neural Networks (CNN)-based methods. However, the experimental evaluation conducted in this paper on several benchmarking datasets against 36 state-of-the-art descriptors from the literature demonstrates that a neural network that contains no convolutional layer, such as Vision Transformer, can shape a global descriptor and achieve competitive results. As fine-tuning is not required, the presented methodology's low complexity encourages adoption of the architecture as an image retrieval baseline model, replacing the traditional and well adopted CNN-based approaches and inaugurating a new era in image retrieval approaches.
翻译:本文介绍了一个插件和剧本描述符,可以在无需事先启动或准备的情况下有效用于图像检索任务。描述方法使用最近提议的愿景变换网络,而不需要任何培训数据来调整参数。在图像检索任务中,过去几年来,手制全球和地方描述符的使用被革命神经网络(CNN)基于的方法非常成功地取代。然而,本文件针对文献中36个最先进的描述符对若干基准数据集进行的实验性评价表明,一个没有动态层的神经网络,如愿景变换器,可以形成一个全球描述符,并取得竞争性结果。由于不需要微调,所提出的方法的低复杂性鼓励采用该结构作为图像检索基线模型,取代传统和广泛采用的CNN方法,并在图像检索方法中开创一个新时代。