As a promising solution for model compression, knowledge distillation (KD) has been applied in recommender systems (RS) to reduce inference latency. Traditional solutions first train a full teacher model from the training data, and then transfer its knowledge (\ie \textit{soft labels}) to supervise the learning of a compact student model. However, we find such a standard distillation paradigm would incur serious bias issue -- popular items are more heavily recommended after the distillation. This effect prevents the student model from making accurate and fair recommendations, decreasing the effectiveness of RS. In this work, we identify the origin of the bias in KD -- it roots in the biased soft labels from the teacher, and is further propagated and intensified during the distillation. To rectify this, we propose a new KD method with a stratified distillation strategy. It first partitions items into multiple groups according to their popularity, and then extracts the ranking knowledge within each group to supervise the learning of the student. Our method is simple and teacher-agnostic -- it works on distillation stage without affecting the training of the teacher model. We conduct extensive theoretical and empirical studies to validate the effectiveness of our proposal. We release our code at: https://github.com/chengang95/UnKD.
翻译:作为模型压缩的一个有希望的解决方案,在建议系统(RS)中应用了知识蒸馏(KD),以减少推导延迟。传统解决方案首先从培训数据中培养一个完整的教师模型,然后传授其知识(\ie\ textit{so labels}),以监督学习一个紧凑的学生模型。然而,我们发现这种标准蒸馏模式将产生严重的偏见问题 -- -- 在蒸馏后,更强烈地推荐受欢迎的项目。这种效果使学生模型无法提出准确和公平的建议,降低RS的效力。在这项工作中,我们确定了KD偏见的根源 -- -- 其根源是教师的偏差软标签,并在蒸馏过程中进一步推广和加强。为了纠正这一点,我们提出了一种新的KD方法,采用分化蒸馏战略。首先根据不同群体的受欢迎程度,将项目分解成多个组,然后在每一组中提取监督学生学习的分级知识。我们的方法简单,教师-神学 -- -- 它在提炼阶段工作,而不会影响教师的柔软标签,并且在蒸馏过程中进一步传播。我们进行广泛的理论和实验性研究。我们在SUDB/Squal Prostrualevualevual Stud 校的示范。我们进行理论和实验性研究。我们如何验证。我们如何验证。我们如何验证。