In this paper, we consider the problem of improving the inference latency of language model-based dense retrieval systems by introducing structural compression and model size asymmetry between the context and query encoders. First, we investigate the impact of pre and post-training compression on the MSMARCO, Natural Questions, TriviaQA, SQUAD, and SCIFACT, finding that asymmetry in the dual encoders in dense retrieval can lead to improved inference efficiency. Knowing this, we introduce Kullback Leibler Alignment of Embeddings (KALE), an efficient and accurate method for increasing the inference efficiency of dense retrieval methods by pruning and aligning the query encoder after training. Specifically, KALE extends traditional Knowledge Distillation after bi-encoder training, allowing for effective query encoder compression without full retraining or index generation. Using KALE and asymmetric training, we can generate models which exceed the performance of DistilBERT despite having 3x faster inference.
翻译:在本文中,我们考虑通过引入上下文嵌入器和查询嵌入器之间的结构压缩和模型大小不对称性来改善基于语言模型的稠密检索系统的推断延迟问题。首先,我们研究了预训练和后训练压缩对MSMARCO、自然问答、TriviaQA、SQUAD和SCIFACT的影响,发现稠密检索中双编码器的不对称性可以提高推断效率。因此,我们引入了一种称为嵌入的Kullback Leibler对齐(KALE)的方法,通过在训练后修剪和对齐查询嵌入器来提高稠密检索方法的推断效率。具体而言,KALE扩展了传统的知识蒸馏技术,允许在不进行完全重新训练或生成索引的情况下进行有效的查询嵌入器压缩。使用KALE和不对称训练,我们可以生成优于DistilBERT性能的模型,推断速度快3倍。