Being able to reliably assess not only the accuracy but also the uncertainty of models' predictions is an important endeavour in modern machine learning. Even if the model generating the data and labels is known, computing the intrinsic uncertainty after learning the model from a limited number of samples amounts to sampling the corresponding posterior probability measure. Such sampling is computationally challenging in high-dimensional problems and theoretical results on heuristic uncertainty estimators in high-dimensions are thus scarce. In this manuscript, we characterise uncertainty for learning from limited number of samples of high-dimensional Gaussian input data and labels generated by the probit model. We prove that the Bayesian uncertainty (i.e. the posterior marginals) can be asymptotically obtained by the approximate message passing algorithm, bypassing the canonical but costly Monte Carlo sampling of the posterior. We then provide a closed-form formula for the joint statistics between the logistic classifier, the uncertainty of the statistically optimal Bayesian classifier and the ground-truth probit uncertainty. The formula allows us to investigate calibration of the logistic classifier learning from limited amount of samples. We discuss how over-confidence can be mitigated by appropriately regularising, and show that cross-validating with respect to the loss leads to better calibration than with the 0/1 error.
翻译:不仅能够可靠地评估模型预测的准确性,而且能够可靠地评估模型预测的不确定性,是现代机器学习中的一项重要工作。即使数据与标签生成模型已知,在从数量有限的抽样中了解模型后计算内在不确定性,就等于对相应的远地点概率测量进行抽样。这种抽样在高层次问题中具有计算上的挑战性,因此,高层次不确定性估算器的理论结果也缺乏。在本手稿中,我们描述从有限数量的高尺度输入数据和比比比模型生成的标签样本中学习的不确定性。我们证明Bayesian不确定性(即后边边缘)可以通过近似信息传递算法来轻视地计算出内在不确定性,绕过高层次但昂贵的蒙特卡洛对远地点的取样,我们然后为后勤分类器之间的联合统计提供一个封闭式公式,从统计上最优化的Bayesian分类器的不确定性和地面图比的不确定性中找出。这个公式使我们能够调查精确等级校准的校准(即后边边边边边边边线)的不确定性,可以通过精确度算方法,通过精确度的正常的校准法学习,从有限数量的标尺来更好地显示损失的校准。我们可以讨论如何过度,比的校准,比的校准的校准,比的校准方法可以更好地研究。