In the field of cross-modal retrieval, single encoder models tend to perform better than dual encoder models, but they suffer from high latency and low throughput. In this paper, we present a dual encoder model called BagFormer that utilizes a cross modal interaction mechanism to improve recall performance without sacrificing latency and throughput. BagFormer achieves this through the use of bag-wise interactions, which allow for the transformation of text to a more appropriate granularity and the incorporation of entity knowledge into the model. Our experiments demonstrate that BagFormer is able to achieve results comparable to state-of-the-art single encoder models in cross-modal retrieval tasks, while also offering efficient training and inference with 20.72 times lower latency and 25.74 times higher throughput.
翻译:在跨模式检索领域,单编码器模型的性能往往优于双元编码器模型,但它们受到高潜值和低吞吐量的影响。在本文中,我们提出了一个称为BagFormer的双重编码器模型,它利用一个跨模式互动机制来提高召回性能,同时又不牺牲延吐和吞吐量。BagFormer通过使用包式互动实现这一点,从而可以将文本转换为更合适的颗粒,并将实体知识纳入模型。我们的实验表明,BagFormer能够在跨模式检索任务中取得与最新最先进的单一编码器模型相类似的成果,同时提供高效的培训和推断,20.72倍的低延吐量和25.74倍的较高吞吐量。