Multimodal pre-training has propelled great advancement in vision-and-language research. These large-scale pre-trained models, although successful, fatefully suffer from slow inference speed due to enormous computation cost mainly from cross-modal attention in Transformer architecture. When applied to real-life applications, such latency and computation demand severely deter the practical use of pre-trained models. In this paper, we study Image-text retrieval (ITR), the most mature scenario of V+L application, which has been widely studied even prior to the emergence of recent pre-trained models. We propose a simple yet highly effective approach, LightningDOT that accelerates the inference time of ITR by thousands of times, without sacrificing accuracy. LightningDOT removes the time-consuming cross-modal attention by pre-training on three novel learning objectives, extracting feature indexes offline, and employing instant dot-product matching with further re-ranking, which significantly speeds up retrieval process. In fact, LightningDOT achieves new state of the art across multiple ITR benchmarks such as Flickr30k, COCO and Multi30K, outperforming existing pre-trained models that consume 1000x magnitude of computational hours. Code and pre-training checkpoints are available at https://github.com/intersun/LightningDOT.
翻译:培训前的多式培训推动了视觉和语言研究的巨大进步。这些大型的预先培训模型虽然成功,但最终会受到缓慢的推断速度的影响,因为计算成本巨大,主要来自变异器结构中的跨式关注。在应用到现实应用时,这种潜伏和计算需求严重妨碍了对预培训模型的实际使用。在本文中,我们研究V+L应用的最成熟的图像文本检索(ITR),这是甚至在最近经过培训的模型出现之前就已经广泛研究过的V+L应用的最成熟的情景。我们提出了一个简单但非常有效的方法,即闪电DOT在不牺牲准确性的情况下,将ITR的推断时间加速数千次。LightningDOT在三个新的学习目标的预培训中取消了耗时的跨式关注,提取了离线功能指数,并使用即时的 dot产品匹配进一步重新排序,大大加快了检索进程。事实上,LightningDOT在Flick30k、CO和MUFL30-D30-K等多个基准中实现了新的艺术状态,在1000级/Mex-comtrainal train trainmental codementalmental codestrational ammentalmentalmentalmentalmentalmentaldaldaldaldaldalbalgresmstrationaldalgalgalgresmmationalgalgalgalms。