Self-distillation (SD) is the process of first training a \enquote{teacher} model and then using its predictions to train a \enquote{student} model with the \textit{same} architecture. Specifically, the student's objective function is $\big(\xi*\ell(\text{teacher's predictions}, \text{ student's predictions}) + (1-\xi)*\ell(\text{given labels}, \text{ student's predictions})\big)$, where $\ell$ is some loss function and $\xi$ is some parameter $\in [0,1]$. Empirically, SD has been observed to provide performance gains in several settings. In this paper, we theoretically characterize the effect of SD in two supervised learning problems with \textit{noisy labels}. We first analyze SD for regularized linear regression and show that in the high label noise regime, the optimal value of $\xi$ that minimizes the expected error in estimating the ground truth parameter is surprisingly greater than 1. Empirically, we show that $\xi > 1$ works better than $\xi \leq 1$ even with the cross-entropy loss for several classification datasets when 50\% or 30\% of the labels are corrupted. Further, we quantify when optimal SD is better than optimal regularization. Next, we analyze SD in the case of logistic regression for binary classification with random label corruption and quantify the range of label corruption in which the student outperforms the teacher in terms of accuracy. To our knowledge, this is the first result of its kind for the cross-entropy loss.
翻译:自我蒸馏( SD) 是第一次训练 \ enquote{ 教师} 模式的过程, 然后用它的预测来用\ textit{ same} 架构来训练 entral quarte{ student} 模型。 具体地说, 学生的目标函数是 $\ big (\ xxit{ 教师的预测},\ text{ 学生的预测} + (1-\xi) + (\ text{ 标签} )\\ text{ renter relider labild}\ big) $, 美元是某种损失函数, 美元值是某种损失值, 美元=xxxxxxy 模型显示SDI 有两个监管的学习问题。 我们首先分析SDD( ) 是否正常的线性回归, 并在高标签的噪音制度中显示, $xxxxx 的分类值是最佳值, 在估算地面真相参数的预期错误值中, $=xxxxxxxxxxxxxx 的数值值值值值值值比 。