Knowledge distillation is an effective way to transfer knowledge from a strong teacher to an efficient student model. Ideally, we expect the better the teacher is, the better the student. However, this expectation does not always come true. It is common that a better teacher model results in a bad student via distillation due to the nonnegligible gap between teacher and student. To bridge the gap, we propose PROD, a PROgressive Distillation method, for dense retrieval. PROD consists of a teacher progressive distillation and a data progressive distillation to gradually improve the student. We conduct extensive experiments on five widely-used benchmarks, MS MARCO Passage, TREC Passage 19, TREC Document 19, MS MARCO Document and Natural Questions, where PROD achieves the state-of-the-art within the distillation methods for dense retrieval. The code and models will be released.
翻译:知识蒸馏是将知识从一个强健的教师转移到一个高效的学生模式的有效方法。 理想的情况是,我们期望教师越好,学生越好。 但是,这一期望并不总是实现。 通常,由于师生之间无法忽略的差距,一个更好的教师模式通过蒸馏造成一个坏学生。 为了缩小这一差距,我们建议PROD(一种渐进式蒸馏方法)用于密集的检索。 PROD(一种渐进式蒸馏法)包括一个教师渐进式蒸馏法和一个数据渐进式蒸馏法,以逐步改善学生状况。我们在五种广泛使用的基准(MS MARCO Passage, TREC Passage 19, TREC Document 19, MS MARCO Document and Natural questions)上进行了广泛的实验,PROD(一种渐进式蒸馏法)在密集的回收方法中达到了最新水平。 代码和模型将被发布。