改进具有跨建筑知识蒸馏的高效神经排级模型 (Improving Efficient Neural Ranking Models with Cross-Architecture Knowledge Distillation)

Retrieval and ranking models are the backbone of many applications such as web search, open domain QA, or text-based recommender systems. The latency of neural ranking models at query time is largely dependent on the architecture and deliberate choices by their designers to trade-off effectiveness for higher efficiency. This focus on low query latency of a rising number of efficient ranking architectures make them feasible for production deployment. In machine learning an increasingly common approach to close the effectiveness gap of more efficient models is to apply knowledge distillation from a large teacher model to a smaller student model. We find that different ranking architectures tend to produce output scores in different magnitudes. Based on this finding, we propose a cross-architecture training procedure with a margin focused loss (Margin-MSE), that adapts knowledge distillation to the varying score output distributions of different BERT and non-BERT passage ranking architectures. We apply the teachable information as additional fine-grained labels to existing training triples of the MSMARCO-Passage collection. We evaluate our procedure of distilling knowledge from state-of-the-art concatenated BERT models to four different efficient architectures (TK, ColBERT, PreTT, and a BERT CLS dot product model). We show that across our evaluated architectures our Margin-MSE knowledge distillation significantly improves re-ranking effectiveness without compromising their efficiency. Additionally, we show our general distillation method to improve nearest neighbor based index retrieval with the BERT dot product model, offering competitive results with specialized and much more costly training methods. To benefit the community, we publish the teacher-score training files in a ready-to-use package.

翻译：网络搜索、开放域域 QA 或基于文本的建议系统等许多应用程序的主干系统是检索和排名模型。神经排序模型在查询时间的延缓性在很大程度上取决于结构及其设计师为提高效率而选择的取舍效率。重点是效率排序结构的低调延缓性,这使得它们能够用于生产部署。在机器学习一种日益常见的方法以缩小效率更高的模型的实效差距,就是将知识蒸馏从一个大型教师模型到一个较小的学生模型。我们发现,不同的排序结构往往产生不同级别的产出分数。基于这一发现,我们提议了一个跨结构化培训程序,以利差为主(Margin-MSE ), 将知识蒸馏改编适应不同BERT和非BERT的分数排名结构。我们将教益性信息应用为更精细的模型标签, 用于现有的高级师资升级标准, Restrial Rial- deassagement 收藏。我们评估了我们从高水平的CAR- Restal Produal Produal Produal Produstrational Produstration the the the we State- we demstration Credustration Credustration the Chess the Checks the we State- we State Teal Destrual Destrational Develital shal dust the we.