Bayesian 数据选择 (Bayesian data selection)

Insights into complex, high-dimensional data can be obtained by discovering features of the data that match or do not match a model of interest. To formalize this task, we introduce the "data selection" problem: finding a lower-dimensional statistic - such as a subset of variables - that is well fit by a given parametric model of interest. A fully Bayesian approach to data selection would be to parametrically model the value of the statistic, nonparametrically model the remaining "background" components of the data, and perform standard Bayesian model selection for the choice of statistic. However, fitting a nonparametric model to high-dimensional data tends to be highly inefficient, statistically and computationally. We propose a novel score for performing both data selection and model selection, the "Stein volume criterion", that takes the form of a generalized marginal likelihood with a kernelized Stein discrepancy in place of the Kullback-Leibler divergence. The Stein volume criterion does not require one to fit or even specify a nonparametric background model, making it straightforward to compute - in many cases it is as simple as fitting the parametric model of interest with an alternative objective function. We prove that the Stein volume criterion is consistent for both data selection and model selection, and we establish consistency and asymptotic normality (Bernstein-von Mises) of the corresponding generalized posterior on parameters. We validate our method in simulation and apply it to the analysis of single-cell RNA sequencing datasets using probabilistic principal components analysis and a spin glass model of gene regulation.

翻译：透视到复杂、高维的数据, 可以通过发现匹配或不匹配相关模型的数据特征来获取。为了正式确定这项任务, 我们引入了“ 数据选择” 问题 : 找到一个适合特定参数模型的低维统计 — — 比如一个变量子集 — — 适合特定参数模型。完全的巴伊西亚数据选择方法是, 模拟统计数据的价值, 非对称模型数据剩余“ 后台” 组件, 并为选择统计执行标准的巴耶西亚模型选择。然而, 将非参数模型与高维数据匹配, 往往效率、统计性和计算性都非常低。我们建议为进行数据选择和模型选择, 即“ 数量标准 ” 进行新的分数, 其形式是普遍的边际可能性, 取代 Kullback- Lebeller 差异。 Stein 量标准并不要求一个匹配或甚至指定一个非参数背景模型, 使得它能够直截地进行统计。但是, 我们在许多情况下, 将它简单地用于对数据选择的精度的精度的精度和精度分析。