充分数字与密度估计：基于广义有限Pólya树的贝叶斯非参数方法 (Sufficient digits and density estimation: A Bayesian nonparametric approach using generalized finite Pólya trees)

This paper proposes a novel approach for statistical modelling of a continuous random variable $X$ on $[0, 1)$, based on its digit representation $X=.X_1X_2\ldots$. In general, $X$ can be coupled with a latent random variable $N$ so that $(X_1,\ldots,X_N)$ becomes a sufficient statistics and $.X_{N+1}X_{N+2}\ldots$ is uniformly distributed. In line with this fact, and focusing on binary digits for simplicity, we propose a family of generalized finite P{ó}lya trees that induces a random density for a sample, which becomes a flexible tool for density estimation. Here, the digit system may be random and learned from the data. We provide a detailed Bayesian analysis, including closed form expression for the posterior distribution. We analyse the frequentist properties as the sample size increases, and provide sufficient conditions for consistency of the posterior distributions of the random density and $N$. We consider an extension to data spanning multiple orders of magnitude, and propose a prior distribution that encodes the so-called extended Newcomb-Benford law. Such a model shows promising results for density estimation of human-activity data. Our methodology is illustrated on several synthetic and real datasets.

翻译：本文提出了一种基于连续随机变量$X$在$[0, 1)$区间上数字表示$X=.X_1X_2\ldots$的统计建模新方法。一般而言，$X$可与潜在随机变量$N$耦合，使得$(X_1,\ldots,X_N)$成为充分统计量，且$.X_{N+1}X_{N+2}\ldots$服从均匀分布。基于这一事实，并聚焦于二进制数字以简化问题，我们提出了一族广义有限Pólya树，该族为样本诱导出随机密度函数，成为密度估计的灵活工具。此处的数字系统可以是随机的，并能从数据中学习。我们提供了详细的贝叶斯分析，包括后验分布的闭式表达式。随着样本量增加，我们分析了频率学性质，并为随机密度函数与$N$的后验分布一致性提供了充分条件。我们考虑了跨多个数量级数据的扩展，并提出了一种编码所谓扩展纽康-本福德律的先验分布。该模型在人类活动数据的密度估计中展现出有前景的结果。我们的方法在多个合成与真实数据集上得到了验证。