Deep learning for Information Retrieval (IR) requires a large amount of high-quality query-document relevance labels, but such labels are inherently sparse. Label smoothing redistributes some observed probability mass over unobserved instances, often uniformly, uninformed of the true distribution. In contrast, we propose knowledge distillation for informed labeling, without incurring high computation overheads at evaluation time. Our contribution is designing a simple but efficient teacher model which utilizes collective knowledge, to outperform state-of-the-arts distilled from a more complex teacher model. Specifically, we train up to x8 faster than the state-of-the-art teacher, while distilling the rankings better. Our code is publicly available at https://github.com/jihyukkim-nlp/CollectiveKD.
翻译:深入学习信息检索(IR) 需要大量高质量的查询文件相关标签,但这类标签本来就很稀少。 标签平滑的分布比未观测到的事例多出一些观测到的概率质量, 通常统一且不知情。 相反, 我们建议为知情标签进行知识蒸馏, 而不会在评估时间引起高计算间接费用。 我们的贡献是设计一个简单而有效的教师模式, 利用集体知识, 超越从更复杂的教师模式中蒸馏出来的艺术水平。 具体地说, 我们比最先进的教师培训速度快到x8, 同时将排名蒸馏得更好。 我们的代码可以在https:// github.com/jihyukm- nlp/CollecentKD上公开查阅 。