While generative modeling has been ubiquitous in natural language processing and computer vision, its application to image retrieval remains unexplored. In this paper, we recast image retrieval as a form of generative modeling by employing a sequence-to-sequence model, contributing to the current unified theme. Our framework, IRGen, is a unified model that enables end-to-end differentiable search, thus achieving superior performance thanks to direct optimization. While developing IRGen we tackle the key technical challenge of converting an image into quite a short sequence of semantic units in order to enable efficient and effective retrieval. Empirical experiments demonstrate that our model yields significant improvement over three commonly used benchmarks, for example, 22.9\% higher than the best baseline method in precision@10 on In-shop dataset with comparable recall@10 score.
翻译:虽然生成建模在自然语言处理和计算机视觉领域广泛应用,但其在图像检索中的应用仍未被探索。本文通过采用序列对序列模型将图像检索重新定义为一种生成建模的形式,为当前的统一主题做出了贡献。我们的框架 IRGen 是一个统一的模型,能实现端到端的可微分搜索,因此通过直接优化而实现了优异的性能。在开发 IRGen 的同时,我们解决了将图像转换为短的语义单元序列的关键技术难题,以实现高效和有效的检索。实证实验表明,本文提出的模型在三个常用基准测试中均取得了显著提升,例如,在 In-shop 数据集上,本文模型的 precision@10 指标比最佳基准模型高出 22.9%,且 recall@10 相当。