Knowledge distillation is an efficient strategy to use data generated by large "teacher" language models to train smaller capable "student" models, but selecting the optimal teacher for a specific student-task combination requires expensive trial-and-error. We propose a lightweight score called GRACE to quantify how effective a teacher will be for post-training a student model. GRACE measures distributional properties of the student's gradients without access to a verifier, teacher logits, teacher internals, or test data. From an information-theoretic perspective, GRACE connects to leave-one-out stability of gradient-based algorithms, which controls the generalization performance of the distilled students. On GSM8K and MATH, GRACE correlates strongly (up to 86% Spearman correlation) with the performance of the distilled LLaMA and OLMo students. In particular, training a student using the GRACE-selected teacher can improve the performance by up to 7.4% over naively using the best-performing teacher. Further, GRACE can provide guidance on crucial design choices in distillation, including (1) the best temperature to use when generating from the teacher, (2) the best teacher to use given a size constraint, and (3) the best teacher to use within a specific model family. Altogether, our findings demonstrate that GRACE can efficiently and effectively identify a strongly compatible teacher for a given student and provide fine-grained guidance on how to perform distillation.
翻译:知识蒸馏是一种利用大型“教师”语言模型生成的数据来训练较小但能力强的“学生”模型的有效策略,但为特定的学生-任务组合选择最优教师需要昂贵的试错过程。我们提出了一种名为GRACE的轻量级评分方法,用于量化教师对特定学生模型进行后训练的有效性。GRACE通过测量学生梯度的分布特性来实现这一目标,且无需访问验证器、教师对数概率、教师内部参数或测试数据。从信息论的角度来看,GRACE与基于梯度的算法的留一法稳定性相关联,这种稳定性控制着蒸馏后学生的泛化性能。在GSM8K和MATH数据集上,GRACE与蒸馏后的LLaMA和OLMo学生模型的性能表现出强相关性(斯皮尔曼相关系数最高达86%)。具体而言,使用GRACE选择的教师训练学生模型,其性能相比简单使用表现最佳的教师可提升高达7.4%。此外,GRACE能为蒸馏过程中的关键设计选择提供指导,包括:(1)从教师模型生成数据时的最佳温度设置,(2)在给定规模约束下的最佳教师选择,以及(3)在特定模型系列内的最佳教师选择。总之,我们的研究结果表明,GRACE能够高效且有效地为给定学生识别出高度兼容的教师,并为如何执行蒸馏提供细粒度的指导。