以数据为基础的内核密度密度最佳带宽统计抽样估计 (Data-Based Optimal Bandwidth for Kernel Density Estimation of Statistical Samples)

It is a common practice to evaluate probability density function or matter spatial density function from statistical samples. Kernel density estimation is a frequently used method, but to select an optimal bandwidth of kernel estimation, which is completely based on data samples, is a long-term issue that has not been well settled so far. There exist analytic formulae of optimal kernel bandwidth, but they cannot be applied directly to data samples, since they depend on the unknown underlying density functions from which the samples are drawn. In this work, we devise an approach to pick out the totally data-based optimal bandwidth. First, we derive correction formulae for the analytic formulae of optimal bandwidth to compute the roughness of the sample's density function. Then substitute the correction formulae into the analytic formulae for optimal bandwidth, and through iteration, we obtain the sample's optimal bandwidth. Compared with analytic formulae, our approach gives very good results, with relative differences from the analytic formulae being only 2%-3% for a sample size larger than 10^4. This approach can also be generalized easily to cases of variable kernel estimations.

翻译：从统计样本中评估概率密度函数或物质空间密度函数的常见做法是评估概率密度函数或物质空间密度函数。内核密度估计是一种常用的方法, 但选择一个完全基于数据样本的最佳内核估计带宽是一个长期问题, 至今尚未很好地解决。目前存在最佳内核带宽的分析公式, 但不能直接应用于数据样本, 因为它们取决于提取样本的未知的内在密度函数。在这项工作中, 我们设计了一个方法来选择完全基于数据的最佳带宽。首先, 我们为最佳带宽的分析公式提出校正公式, 以计算样本密度函数的粗度。然后, 将校正公式替换为分析公式, 以优化带宽, 通过迭代, 我们获得样本的最佳带宽。与分析公式相比, 我们的方法效果很好, 与分析公式相比, 分析公式的相对差异只有2%- 3%, 大于 10+4 4。这个方法也可以很容易被广泛化为变量内核估计案例。