使用 Gaussian 比例缩放混音前期使用大数据在贝叶斯高维回归中拼写 (Sketching in Bayesian High Dimensional Regression With Big Data Using Gaussian Scale Mixture Priors)

Bayesian computation of high dimensional linear regression models with a popular Gaussian scale mixture prior distribution using Markov Chain Monte Carlo (MCMC) or its variants can be extremely slow or completely prohibitive due to the heavy computational cost that grows in the cubic order of p, with p as the number of features. Although a few recently developed algorithms make the computation efficient in presence of a small to moderately large sample size (with the complexity growing in the cubic order of n), the computation becomes intractable when sample size n is also large. In this article we adopt the data sketching approach to compress the n original samples by a random linear transformation to m<<n samples in p dimensions, and compute Bayesian regression with Gaussian scale mixture prior distributions with the randomly compressed response vector and feature matrix. Our proposed approach yields computational complexity growing in the cubic order of m. Another important motivation for this compression procedure is that it anonymizes the data by revealing little information about the original data in the course of analysis. Our detailed empirical investigation with the Horseshoe prior from the class of Gaussian scale mixture priors shows closely similar inference and a massive reduction in per iteration computation time of the proposed approach compared to the regression with the full sample. One notable contribution of this article is to derive posterior contraction rate for high dimensional predictor coefficient with a general class of shrinkage priors on them under data compression/sketching. In particular, we characterize the dimension of the compressed response vector m as a function of the sample size, number of predictors and sparsity in the regression to guarantee accurate estimation of predictor coefficients asymptotically, even after data compression.

翻译：使用 Markov 链子 Monte Carlo (MCMC ) 或其变体来压缩原样本, 其原始样本, 随机线性转换为 m ⁇ 精确尺寸的样本, 其数据草图方法可能非常慢或完全令人望而却步, 这是因为在p 的立方顺序中, 以p 为特性数。虽然最近开发的几部算法使计算方法在样本大小小到中等大( 立方顺序中的复杂性不断增长)的情况下效率较高, 但当样本大小 n 也很大时, 计算就变得难以计算。在文章中,我们采用数据草图绘制方法,通过随机线性转换到 p 尺寸 m ⁇ 样本样本的样本, 来压缩原样本的原样本, 以及其变缩缩缩缩缩, 在预变缩的递增速度中, 与预估的递减数据递增速度相近。