Bayesian neural networks (BNNs) combine the expressive power of deep learning with the advantages of Bayesian formalism. In recent years, the analysis of wide, deep BNNs has provided theoretical insight into their priors and posteriors. However, we have no analogous insight into their posteriors under approximate inference. In this work, we show that mean-field variational inference entirely fails to model the data when the network width is large and the activation function is odd. Specifically, for fully-connected BNNs with odd activation functions and a homoscedastic Gaussian likelihood, we show that the optimal mean-field variational posterior predictive (i.e., function space) distribution converges to the prior predictive distribution as the width tends to infinity. We generalize aspects of this result to other likelihoods. Our theoretical results are suggestive of underfitting behavior previously observered in BNNs. While our convergence bounds are non-asymptotic and constants in our analysis can be computed, they are currently too loose to be applicable in standard training regimes. Finally, we show that the optimal approximate posterior need not tend to the prior if the activation function is not odd, showing that our statements cannot be generalized arbitrarily.
翻译:Bayesian 神经网络(BNNs) 将深度学习的显性力量与Bayesian正式主义的优势结合起来。 近年来,对广而深的BNNs的分析提供了对其前科和后部的理论洞察。 但是,我们没有类似的洞察力,根据大致的推断,对后部没有类似的洞察力。 在这项工作中,我们显示,当网络宽度大而激活功能奇特时,平均场变异推论完全无法模拟数据。 具体地说,对于完全连接的BNNs和奇特激活功能以及同质高斯的可能性,我们对广泛而深的BNNS的分析表明,最佳的中位变异场后后预测(即功能空间)分布与先前的预测分布相交汇,因为宽度往往不精确。 我们将这一结果的各个方面归纳为其他可能性。 我们的理论结果显示,在网络宽度和激活功能上,我们的趋同线是非被动和常态的,但目前却太松,无法在标准的培训制度中适用。 最后,我们无法显示最优的模拟的模拟状态是无法显示我们最普通化的状态。