Bayesian models have many desirable properties, most notable is their ability to generalize from limited data and to properly estimate the uncertainty in their predictions. However, these benefits come at a steep computational cost as Bayesian inference, in most cases, is computationally intractable. One popular approach to alleviate this problem is using a Monte-Carlo estimation with an ensemble of models sampled from the posterior. However, this approach still comes at a significant computational cost, as one needs to store and run multiple models at test time. In this work, we investigate how to best distill an ensemble's predictions using an efficient model. First, we argue that current approaches that simply return distribution over predictions cannot compute important properties, such as the covariance between predictions, which can be valuable for further processing. Second, in many limited data settings, all ensemble members achieve nearly zero training loss, namely, they produce near-identical predictions on the training set which results in sub-optimal distilled models. To address both problems, we propose a novel and general distillation approach, named Functional Ensemble Distillation (FED), and we investigate how to best distill an ensemble in this setting. We find that learning the distilled model via a simple augmentation scheme in the form of mixup augmentation significantly boosts the performance. We evaluated our method on several tasks and showed that it achieves superior results in both accuracy and uncertainty estimation compared to current approaches.
翻译:贝叶斯模型有许多可取的特性,最显著的是它们能够从有限的数据中概括归纳,并适当估计预测中的不确定性。然而,这些效益的计算成本非常高,因为贝叶斯的推论在多数情况下都是难以计算的。缓解这一问题的一种流行方法是使用蒙特-卡洛估算,从后方抽样的模型组合在一起进行。然而,这一方法仍然具有巨大的计算成本,因为人们需要在测试时储存和运行多个模型。在这项工作中,我们研究如何用一个高效模型来优化组合预测的预测。首先,我们认为,目前简单地对预测进行回流分布的方法无法计算重要的属性,例如预测之间的共变数,这对于进一步处理可能很有价值。第二,在许多有限的数据环境中,所有共同成员都达到近零培训损失,即它们产生几乎相同的模型预测,从而导致次优化的模拟模型。为了解决这两个问题,我们建议对当前精细的预测方法进行新颖和一般的蒸馏方法进行比较,我们称之为从功能增长中找出一个最佳的升级方法,我们从增长中找到一个最佳的方法。