In the past few years, the emergence of vision-language pre-training (VLP) has brought cross-modal retrieval to a new era. However, due to the latency and computation demand, it is commonly challenging to apply VLP in a real-time online retrieval system. To alleviate the defect, this paper proposes a \textbf{Hi}erarchical \textbf{V}ision-\textbf{}Language \textbf{P}re-Training (\textbf{HiVLP}) for fast Image-Text Retrieval (ITR). Specifically, we design a novel hierarchical retrieval objective, which uses the representation of different dimensions for coarse-to-fine ITR, i.e., using low-dimensional representation for large-scale coarse retrieval and high-dimensional representation for small-scale fine retrieval. We evaluate our proposed HiVLP on two popular image-text retrieval benchmarks, i.e., Flickr30k and COCO. Extensive experiments demonstrate that our HiVLP not only has fast inference speed but also can be easily scaled to large-scale ITR scenarios. The detailed results show that HiVLP is $1,427$$\sim$$120,649\times$ faster than the fusion-based model UNITER and 2$\sim$5 faster than the fastest embedding-based model LightingDot in different candidate scenarios. It also achieves about +4.9 AR on COCO and +3.8 AR on Flickr30K than LightingDot and achieves comparable performance with the state-of-the-art (SOTA) fusion-based model METER.
翻译:在过去几年里,视觉语言预培训(VLP)的出现使跨模式检索进入了一个新时代。然而,由于延迟和计算需求,在实时在线检索系统中应用VLP通常具有挑战性。为了减轻缺陷,本文建议使用一个\ textbf{Hi}rargic\ hyrargartial ref{V}v}sion-textbb{p}Language\ textb{p{p}re-tra-tim-f$ (\ textbf{HIVLP}),用于快速的图像-ext Relieval(ITR) 。具体地说,我们设计了一个全新的等级检索目标,它使用不同层面的表达方式来显示粗略的检索和高维度代表小规模的微检索。我们用两种流行的图像-文本模型检索基准来评价我们提议的HVLP,即:Flick30k和CO。 广泛的实验表明,我们的HLP不仅具有快速的可比性,而且能够显示最短的IMFITR$,而且可以轻易地显示最短的成绩。