Under the flourishing development in performance, current image-text retrieval methods suffer from $N$-related time complexity, which hinders their application in practice. Targeting at efficiency improvement, this paper presents a simple and effective keyword-guided pre-screening framework for the image-text retrieval. Specifically, we convert the image and text data into the keywords and perform the keyword matching across modalities to exclude a large number of irrelevant gallery samples prior to the retrieval network. For the keyword prediction, we transfer it into a multi-label classification problem and propose a multi-task learning scheme by appending the multi-label classifiers to the image-text retrieval network to achieve a lightweight and high-performance keyword prediction. For the keyword matching, we introduce the inverted index in the search engine and create a win-win situation on both time and space complexities for the pre-screening. Extensive experiments on two widely-used datasets, i.e., Flickr30K and MS-COCO, verify the effectiveness of the proposed framework. The proposed framework equipped with only two embedding layers achieves $O(1)$ querying time complexity, while improving the retrieval efficiency and keeping its performance, when applied prior to the common image-text retrieval methods. Our code will be released.
翻译:在业绩的蓬勃发展下,当前图像文本检索方法存在与美元相关的时间复杂性,这阻碍了其实际应用。在提高效率方面,本文件提出了一个简单而有效的关键字引导预筛选框架,用于图像文本检索。具体地说,我们将图像和文本数据转换为关键字,并用各种模式进行关键字对齐,以排除在检索网络之前大量不相干的画廊样本。对于关键字预测,我们将其转入一个多标签分类问题,并提议一个多任务学习计划,将多标签分类器附加到图像文本检索网络,以实现轻度和高性能关键字的预测。对于关键字的匹配,我们在搜索引擎中引入倒置索引,并在预筛选的时间和空间复杂性上创造双赢的局面。对两个广泛使用的数据集,即Flickr30K和MS-COCO进行广泛的实验,核查拟议框架的有效性。拟议的框架仅安装了两个嵌入层,仅能达到$O(1)查询时间复杂性。关于关键字的配置框架,我们将在改进之前的检索和保持其业绩,同时将改进我们共同的代码。</s>