Recent studies in image retrieval task have shown that ensembling different models and combining multiple global descriptors lead to performance improvement. However, training different models for ensemble is not only difficult but also inefficient with respect to time or memory. In this paper, we propose a novel framework that exploits multiple global descriptors to get an ensemble-like effect while it can be trained in an end-to-end manner. The proposed framework is flexible and expandable by the global descriptor, CNN backbone, loss, and dataset. Moreover, we investigate the effectiveness of combining multiple global descriptors with quantitative and qualitative analysis. Our extensive experiments show that the combined descriptor outperforms a single global descriptor, as it can utilize different types of feature properties. In the benchmark evaluation, the proposed framework achieves the state-of-the-art performance on the CARS196, CUB200-2011, In-shop Clothes and Stanford Online Products on image retrieval tasks by a large margin compared to competing approaches. Our model implementations and pretrained models are publicly available.
翻译:最近关于图像检索任务的研究显示,将不同的模型组合在一起,并结合多个全球描述符,可以提高性能;然而,培训各种共同描述符不仅困难,而且在时间或记忆方面效率低下;在本文件中,我们提议了一个新框架,利用多个全球描述符获得共性效果,同时以端到端方式对其进行培训;拟议框架具有灵活性,并可通过全球描述符、CNN骨干、损失和数据集加以扩展;此外,我们还调查将多个全球描述符与定量和定性分析相结合的实效;我们的广泛实验表明,合并描述符比单一全球描述符要优于单一的全球描述符,因为它可以使用不同种类的特征特性。在基准评估中,拟议框架实现了CARS196、CUB200-2011、In-shop Chlothes和斯坦福在线产品在图像检索任务方面的最先进的业绩,与竞争性方法相比,其规模较大。我们模型的实施和预培训模型是公开的。