Existing high-dimensional statistical methods are largely established for analyzing individual-level data. In this work, we study estimation and inference for high-dimensional linear models where we only observe "proxy data", which include the marginal statistics and sample covariance matrix that are computed based on different sets of individuals. We develop a rate optimal method for estimation and inference for the regression coefficient vector and its linear functionals based on the proxy data. Moreover, we show the intrinsic limitations in the proxy-data based inference: the minimax optimal rate for estimation is slower than that in the conventional case where individual data are observed; the power for testing and multiple testing does not go to one as the signal strength goes to infinity. These interesting findings are illustrated through simulation studies and an analysis of a dataset concerning the genetic associations of hindlimb muscle weight in a mouse population.
翻译:在这项工作中,我们研究高维线性模型的估算和推论,我们只观察“代理数据”,其中包括根据不同个人组别计算的边际统计和抽样共变矩阵。我们根据代理数据为回归系数矢量及其线性功能制定了一种估计和推论最佳比率方法。此外,我们显示了基于代用数据的推论的内在局限性:在观察到个人数据的常规情况下,最差的估算率比一般情况下要慢;在信号强度达到无限时,测试和多次测试的功率不等于一种。这些有趣的结果通过模拟研究和对老鼠群中伸缩肌肉重量的遗传联系的数据集的分析来说明。