In this paper, we consider the problem of improving the inference latency of language model-based dense retrieval systems by introducing structural compression and model size asymmetry between the context and query encoders. First, we investigate the impact of pre and post-training compression on the MSMARCO, Natural Questions, TriviaQA, SQUAD, and SCIFACT, finding that asymmetry in the dual encoders in dense retrieval can lead to improved inference efficiency. Knowing this, we introduce Kullback Leibler Alignment of Embeddings (KALE), an efficient and accurate method for increasing the inference efficiency of dense retrieval methods by pruning and aligning the query encoder after training. Specifically, KALE extends traditional Knowledge Distillation after bi-encoder training, allowing for effective query encoder compression without full retraining or index generation. Using KALE and asymmetric training, we can generate models which exceed the performance of DistilBERT despite having 3x faster inference.
翻译:在本文中,我们考虑通过引入结构压缩和上下文和查询编码器之间的模型大小不对称性来改善基于语言模型的密集检索系统的推理延迟问题。首先,我们研究了对MSMARCO、自然问答、TriviaQA、SQUAD和SCIFACT进行预训练和后训练压缩的影响,发现密集检索中双编码器之间的不对称性可以提高推理效率。因此,我们介绍了嵌入后的库尔巴克-莱布勒对齐(KALE)方法,这是一种有效而准确的增加密集检索方法推理效率的方法,它通过在训练后对查询编码器进行修剪和对齐。具体而言,KALE 扩展了在双编码器训练后的传统知识蒸馏,允许在不进行完全重新训练或索引生成的情况下有效地压缩查询编码器。使用KALE和不对称训练,我们可以生成性能超过DistilBERT的模型,并具有3倍更快的推理速度。