Statistical divergences (SDs), which quantify the dissimilarity between probability distributions, are a basic constituent of statistical inference and machine learning. A modern method for estimating those divergences relies on parametrizing an empirical variational form by a neural network (NN) and optimizing over parameter space. Such neural estimators are abundantly used in practice, but corresponding performance guarantees are partial and call for further exploration. In particular, there is a fundamental tradeoff between the two sources of error involved: approximation and empirical estimation. While the former needs the NN class to be rich and expressive, the latter relies on controlling complexity. We explore this tradeoff for an estimator based on a shallow NN by means of non-asymptotic error bounds, focusing on four popular $\mathsf{f}$-divergences -- Kullback-Leibler, chi-squared, squared Hellinger, and total variation. Our analysis relies on non-asymptotic function approximation theorems and tools from empirical process theory. The bounds reveal the tension between the NN size and the number of samples, and enable to characterize scaling rates thereof that ensure consistency. For compactly supported distributions, we further show that neural estimators with a slightly different NN growth-rate are near minimax rate-optimal, achieving the parametric convergence rate up to logarithmic factors.
翻译:量化概率分布差异的统计差异(SDs)是统计推断和机器学习的基本组成部分。估算这些差异的现代方法依赖于神经网络(NN)对实验性变异形式进行对称,并优化参数空间。这种神经估计器在实践中使用得很多,但相应的性能保障是局部的,需要进一步探索。特别是,两个错误来源之间有一个基本的权衡:近似值和实证估计。前者需要NN阶级丰富和直观,而后者则依赖复杂性控制。我们探索以浅 NNNE为基础的估算器的这一权衡交易,其依据是非无线误差界限,侧重于四个流行的 $\mathsf{f}$-diverence -- -- Kullback-Lebeper, chi-quald, squmd Hellinger, 和总体变异。我们的分析依赖于非隐含性功能的功能将实验性进程理论和工具相近于该值的正比。我们用NEnalticrial 标定了National 之间的紧张度,让我们能够将缩缩缩缩缩缩缩缩缩缩缩的缩的缩缩缩缩缩的缩缩缩比例。