In real-world systems, models are frequently updated as more data becomes available, and in addition to achieving high accuracy, the goal is to also maintain a low difference in predictions compared to the base model (i.e. predictive ``churn''). If model retraining results in vastly different behavior, then it could cause negative effects in downstream systems, especially if this churn can be avoided with limited impact on model accuracy. In this paper, we show an equivalence between training with distillation using the base model as the teacher and training with an explicit constraint on the predictive churn. We then show that distillation performs strongly for low churn training against a number of recent baselines on a wide range of datasets and model architectures, including fully-connected networks, convolutional networks, and transformers.
翻译:在现实世界系统中,模型经常随着更多数据的提供而更新,除了实现高度准确性之外,目标是保持与基准模型(即预测性“churn”)相比预测的低差异。如果模型再培训导致的行为大相径庭,则可能对下游系统造成负面影响,特别是如果能够避免这种杂质,对模型准确性的影响有限。在本文中,我们显示了使用作为教师的基础模型进行与蒸馏培训的等同性,而培训则对预测性热量有明确的限制。然后我们表明,对于一系列广泛的数据集和模型结构,包括完全连通的网络、革命网络和变压器,最近的一些基线,蒸馏过程非常适合低水平的培训。