Transformers are state-of-the-art in a wide range of NLP tasks and have also been applied to many real-world products. Understanding the reliability and certainty of transformer model predictions is crucial for building trustable machine learning applications, e.g., medical diagnosis. Although many recent transformer extensions have been proposed, the study of the uncertainty estimation of transformer models is under-explored. In this work, we propose a novel way to enable transformers to have the capability of uncertainty estimation and, meanwhile, retain the original predictive performance. This is achieved by learning a hierarchical stochastic self-attention that attends to values and a set of learnable centroids, respectively. Then new attention heads are formed with a mixture of sampled centroids using the Gumbel-Softmax trick. We theoretically show that the self-attention approximation by sampling from a Gumbel distribution is upper bounded. We empirically evaluate our model on two text classification tasks with both in-domain (ID) and out-of-domain (OOD) datasets. The experimental results demonstrate that our approach: (1) achieves the best predictive performance and uncertainty trade-off among compared methods; (2) exhibits very competitive (in most cases, improved) predictive performance on ID datasets; (3) is on par with Monte Carlo dropout and ensemble methods in uncertainty estimation on OOD datasets.
翻译:了解变压器模型预测的可靠性和确定性对于建立可信赖的机器学习应用程序至关重要,例如医学诊断。虽然最近提出了许多变压器扩展建议,但变压器模型不确定性估算的研究没有得到充分探讨。在这项工作中,我们提出了一个新颖的方法,使变压器能够具有不确定性估计能力,同时保留最初的预测性能。这是通过学习一个分级的随机自省,分别关注数值和一套可学习的美甲来实现的。然后,新的注意头与使用Gumbel-Softmax魔术的样本型的百分解器混合形成。我们理论上表明,从Gumbel分布中取样的自关注近似值是高度捆绑起来的。我们实证地评估了我们关于两种文本分类任务的模型,既包括内部的(ID),也包括外部的(OOD)数据集。 实验结果表明,我们的方法是:在最有竞争力的预测性能评估性能,在最有竞争力的风险评估中,在ODO-D 案例上,在最有竞争力的预测性能的预测性评估性能。