Hierarchical Bayesian methods enable information sharing across multiple related regression problems. While standard practice is to model regression parameters (effects) as (1) exchangeable across datasets and (2) correlated to differing degrees across covariates, we show that this approach exhibits poor statistical performance when the number of covariates exceeds the number of datasets. For instance, in statistical genetics, we might regress dozens of traits (defining datasets) for thousands of individuals (responses) on up to millions of genetic variants (covariates). When an analyst has more covariates than datasets, we argue that it is often more natural to instead model effects as (1) exchangeable across covariates and (2) correlated to differing degrees across datasets. To this end, we propose a hierarchical model expressing our alternative perspective. We devise an empirical Bayes estimator for learning the degree of correlation between datasets. We develop theory that demonstrates that our method outperforms the classic approach when the number of covariates dominates the number of datasets, and corroborate this result empirically on several high-dimensional multiple regression and classification problems.
翻译:虽然标准做法是将回归参数(效应)建模为(1) 可互换的跨数据集和(2) 与各种千差数不同程度相关联,但我们认为,当共变数的数量超过数据集的数量时,这一方法的统计性能较差。例如,在统计遗传学中,我们可能会倒退数千个人在多达数百万个基因变异(变数)方面的数十种特性(确定数据集(反应))。当分析师比数据集有更多的共变数时,我们争辩说,模型效果往往比较自然,因为:(1) 共变数之间可以互换,(2) 跨数据集不同程度相关。为此,我们提出一个等级模型,表达我们不同的视角。我们设计了一个经验性海湾估计器,用于学习数据集之间相互关系的程度。我们开发了一种理论,表明当共变数的数量主宰数据集的数量时,我们的方法比经典的方法要差,我们用几个高维的多重回归和分类问题来验证这一结果。