加速Gibbs采样的“大美元和大美元”的事先先决条件的同质梯度方法 (Prior-preconditioned conjugate gradient method for accelerated Gibbs sampling in "large $n$ & large $p$" Bayesian sparse regression)

from arxiv, 35 pages, 7 figures + Supplement (42 pages, 18 figures); Software package available --- see documentation at https://bayes-bridge.readthedocs.io and source code at https://github.com/aki-nishimura/bayes-bridge

In a modern observational study based on healthcare databases, the number of observations and of predictors typically range in the order of $10^5$ ~ $10^6$ and of $10^4$ ~ $10^5$. Despite the large sample size, data rarely provide sufficient information to reliably estimate such a large number of parameters. Sparse regression techniques provide potential solutions, one notable approach being the Bayesian methods based on shrinkage priors. In the "large n & large p" setting, however, posterior computation encounters a major bottleneck at repeated sampling from a high-dimensional Gaussian distribution, whose precision matrix $\Phi$ is expensive to compute and factorize. In this article, we present a novel algorithm to speed up this bottleneck based on the following observation: we can cheaply generate a random vector $b$ such that the solution to the linear system $\Phi \beta = b$ has the desired Gaussian distribution. We can then solve the linear system by the conjugate gradient (CG) algorithm through matrix-vector multiplications by $\Phi$; this involves no explicit factorization or calculation of $\bPhi$ itself. Rapid convergence of CG in this context is guaranteed by the theory of prior-preconditioning we develop. We apply our algorithm to a clinically relevant large-scale observational study with n = 72,489 patients and p = 22,175 clinical covariates, designed to assess the relative risk of adverse events from two alternative blood anti-coagulants. Our algorithm demonstrates an order of magnitude speed-up in the posterior computation.

翻译：在基于医疗保健数据库的现代观测研究中,观测和预测器的数量通常在10美5美元到10美6美元到10美5美元到10美5美元之间。尽管样本规模庞大,数据很少提供足够的信息可靠地估计如此众多的参数。粗化的回归技术提供了潜在的解决方案,一个值得注意的方法是基于缩小前期的巴耶斯方法。在“大型和大型”的设置中,后方计算在高层次分布的反复采样中遇到一个重大的瓶颈,其精密矩阵$\Phi$对于计算和系数化来说费用昂贵。在文章中,我们根据以下观察,提出了加速这一瓶数的新型算法:我们可以廉价生成一种随机矢量$\Phi = b美元,这样线性系统的解决办法就具有理想的加比值分布。随后,我们可以通过高层次分布的替代算法(CG)通过以美元到Phi值的精确矩阵运算法进行线性测算,我们通过以 $=G值计算的母体- cal- colational-calalalalalalalalalalalalal exalalalal ex exal exalation a ex a ex ex exal deal ex ex ex ex ex exmation exmess a exlation ex ex exlation ex ex ex ex ex ex ex ex ex ex ex ex ex ex exlation exlation ex exlation ex ex. ex. exlation ex. ex ex ex.