Image-text retrieval (ITR) is a task to retrieve the relevant images/texts, given the query from another modality. The conventional dense retrieval paradigm relies on encoding images and texts into dense representations using dual-stream encoders, however, it faces challenges with low retrieval speed in large-scale retrieval scenarios. In this work, we propose the lexicon-weighting paradigm, where sparse representations in vocabulary space are learned for images and texts to take advantage of the bag-of-words models and efficient inverted indexes, resulting in significantly reduced retrieval latency. A crucial gap arises from the continuous nature of image data, and the requirement for a sparse vocabulary space representation. To bridge this gap, we introduce a novel pre-training framework, Lexicon-Bottlenecked Language-Image Pre-Training (LexLIP), that learns importance-aware lexicon representations. This framework features lexicon-bottlenecked modules between the dual-stream encoders and weakened text decoders, allowing for constructing continuous bag-of-words bottlenecks to learn lexicon-importance distributions. Upon pre-training with same-scale data, our LexLIP achieves state-of-the-art performance on two benchmark ITR datasets, MSCOCO and Flickr30k. Furthermore, in large-scale retrieval scenarios, LexLIP outperforms CLIP with a 5.5 ~ 221.3X faster retrieval speed and 13.2 ~ 48.8X less index storage memory.
翻译:图像- 图像文本检索( ITR) 是一项任务, 以来自另一种方式的查询为根据, 相关图像/ 文本检索( ITR) 是一项任务。 常规密集的检索模式依赖于使用双流编码器将图像和文本编码成密集的表达方式, 然而, 在大规模检索情景中, 它面临着检索速度低的挑战。 在这项工作中, 我们建议了词汇加权模式, 在词汇空间中为图像和文本学习少许的表达方式, 以便利用双流读数模型和高效反向索引, 导致检索速度大大缩短。 关键的差距来自图像数据的持续性质, 以及对稀薄的词汇空间表达方式的要求。 为了缩小这一差距, 我们引入了一个新的培训前框架, Lexicon- Bottalnecked 语言- 培训前( LexLIP), 学习了重要识别词汇表空间, 用于利用双流读数解码解码器和变弱的文本解码器, 能够构建持续的字袋式瓶颈, 学习CIP- 80 和 Flex- relix- sal- relix- refal- 20 Lsal- dal- sal- dal- ex- disal- dal- ex- ex- sal- sal- dal- sal- sal- sal- ex- ex- ex- ex- ex- ex- ex- ex- ex- ex- ex- ex- sal- sal- sal- sal- ex- ex- ex- sal- sal- sal- ex- sal- sal- sal- sal- ex- sal- sal- sal- sal- ex- ex- sal- ex- sal- sal- sal- ex- sal- sal- ex- ex- sal- 和 labal- ex- ex- ex- ex- ex- labal- dal- ex- ex- ex- ex- ex- ex- ex- sal- sal- sal- sal- ex- sal- dal- sal-