Although BERT-based ranking models have been commonly used in commercial search engines, they are usually time-consuming for online ranking tasks. Knowledge distillation, which aims at learning a smaller model with comparable performance to a larger model, is a common strategy for reducing the online inference latency. In this paper, we investigate the effect of different loss functions for uniform-architecture distillation of BERT-based ranking models. Here "uniform-architecture" denotes that both teacher and student models are in cross-encoder architecture, while the student models include small-scaled pre-trained language models. Our experimental results reveal that the optimal distillation configuration for ranking tasks is much different than general natural language processing tasks. Specifically, when the student models are in cross-encoder architecture, a pairwise loss of hard labels is critical for training student models, whereas the distillation objectives of intermediate Transformer layers may hurt performance. These findings emphasize the necessity of carefully designing a distillation strategy (for cross-encoder student models) tailored for document ranking with pairwise training samples.
翻译:尽管基于BERT的排名模型在商业搜索引擎中常用,但通常用于在线排序任务耗时。 知识蒸馏旨在学习一个与较大模型类似性能的较小模型,目的是学习一个与较大模型相似的小型模型,这是减少在线推断潜伏的通用战略。 在本文中,我们调查了不同损失功能对基于BERT的排名模型的统一结构蒸馏的影响。这里的“单式结构”表示教师和学生模型都存在于交叉编码结构中,而学生模型包括小规模的预培训语言模型。我们的实验结果表明,用于排序任务的最佳蒸馏配置与一般的自然语言处理任务大不相同。具体地说,当学生模型处于交叉编码结构中时,双式硬标签的丢失对于培训学生模型至关重要,而中间变异层的蒸馏目标可能损害业绩。这些结论强调,必须仔细设计一种适合文件排序的蒸馏战略(跨编码学生模型),并配对式培训样本。