In recent years, the transformer has established itself as a workhorse in many applications ranging from natural language processing to reinforcement learning. Similarly, Bayesian deep learning has become the gold-standard for uncertainty estimation in safety-critical applications, where robustness and calibration are crucial. Surprisingly, no successful attempts to improve transformer models in terms of predictive uncertainty using Bayesian inference exist. In this work, we study this curiously underpopulated area of Bayesian transformers. We find that weight-space inference in transformers does not work well, regardless of the approximate posterior. We also find that the prior is at least partially at fault, but that it is very hard to find well-specified weight priors for these models. We hypothesize that these problems stem from the complexity of obtaining a meaningful mapping from weight-space to function-space distributions in the transformer. Therefore, moving closer to function-space, we propose a novel method based on the implicit reparameterization of the Dirichlet distribution to apply variational inference directly to the attention weights. We find that this proposed method performs competitively with our baselines.
翻译:近年来,变压器在从自然语言处理到强化学习等许多应用中被确定为一种工作马。同样,贝耶斯人的深层次学习已成为安全关键应用中不确定性估计的金质标准,在安全关键应用中,稳健和校准至关重要。令人惊讶的是,没有成功尝试利用贝耶斯人的推理改进变压器模型的预测不确定性。在这项工作中,我们研究了贝耶斯变压器这个人口稀少的地区。我们发现变压器中的重量-空间推推力不起作用,而不管其近似后方。我们也发现先前的变压器至少部分是错的,但很难找到这些模型的准确的重量。我们假设,这些问题产生于从重力空间到变压器中功能-空间分布的复杂程度。因此,我们更接近功能-空间,我们提出一种新的方法,即根据Drichlet分布的隐含重新校准,将变压法直接适用于重力的重量。我们发现,这个拟议的方法与我们的基线竞争。