Ensembles of models have been empirically shown to improve predictive performance and to yield robust measures of uncertainty. However, they are expensive in computation and memory. Therefore, recent research has focused on distilling ensembles into a single compact model, reducing the computational and memory burden of the ensemble while trying to preserve its predictive behavior. Most existing distillation formulations summarize the ensemble by capturing its average predictions. As a result, the diversity of the ensemble predictions, stemming from each member, is lost. Thus, the distilled model cannot provide a measure of uncertainty comparable to that of the original ensemble. To retain more faithfully the diversity of the ensemble, we propose a distillation method based on a single multi-headed neural network, which we refer to as Hydra. The shared body network learns a joint feature representation that enables each head to capture the predictive behavior of each ensemble member. We demonstrate that with a slight increase in parameter count, Hydra improves distillation performance on classification and regression settings while capturing the uncertainty behavior of the original ensemble over both in-domain and out-of-distribution tasks.
翻译:各种模型的集合已经从经验上显示出来改进预测性,并产生强有力的不确定度,但是在计算和记忆方面费用昂贵。因此,最近的研究侧重于将集合蒸馏成一个单一的紧凑模型,减少合金的计算和内存负担,同时努力保持其预测行为。大多数现有的蒸馏配方通过捕捉其平均预测来总结组合。结果,每个成员产生的共合预测的多样性消失了。因此,蒸馏模型无法提供与原合金相类似的不确定性的测量。为了更忠实地保留共合物的多样性,我们提议了一个基于单一多头神经网络的蒸馏方法,我们称之为海德拉。共同体网络学会一个联合特征代表,使每个头都能够捕捉每个共集成员预测性的行为。我们证明,随着参数的略增,海德拉在分类和回归环境上的蒸馏性表现在分类和回归环境上的微增加,同时捕捉到原合金在内部的过度分配和过度分配任务中的不确定性行为。