Deep learning for Information Retrieval (IR) requires a large amount of high-quality query-document relevance labels, but such labels are inherently sparse. Label smoothing redistributes some observed probability mass over unobserved instances, often uniformly, uninformed of the true distribution. In contrast, we propose knowledge distillation for informed labeling, without incurring high computation overheads at evaluation time. Our contribution is designing a simple but efficient teacher model which utilizes collective knowledge, to outperform state-of-the-arts distilled from a more complex teacher model. Specifically, we train up to x8 faster than the state-of-the-art teacher, while distilling the rankings better. Our code is publicly available at https://github.com/jihyukkim-nlp/CollectiveKD
翻译:深入学习信息检索(IR)需要大量高质量的查询文件相关标签,但这类标签本来就很稀少。 label 平滑的分布比未观察的事例中观察到的概率质量要高, 通常统一且不知情。 相反, 我们建议为知情标签进行知识蒸馏, 而不会在评估时间引起高计算间接费用。 我们的贡献是设计一个简单而有效的教师模式, 利用集体知识, 超越从更复杂的教师模式中蒸馏的艺术水平。 具体来说, 我们培训的速度比最先进的教师快到x8, 同时将排名蒸馏得更好。 我们的代码在 https://github.com/jihyukkim- nlp/ CollecentKD 上公开提供。