Statistical inference for high-dimensional regression heteroskedasticity is an important but under-explored problem. The current paper aims at filling this gap by proposing two tests, namely the variance difference test and the variance difference Breusch-Pagan test, for assessing high-dimensional regression heteroskedasticity. The former tests whether an explanatory feature of interest is associated with the conditional variance of a response variable, while the latter tests heteroskedasticity in the regression, which is known to be the Breusch-Pagan test problem. To formally establish the tests, we have derived rigorous P-values and test sizes, and analyzed the test power under a nonparametric heteroskedastic data generating model with high-dimensional input features. Such a model setting takes into account high-dimensional applications with flexible structures of heteroskedasticity and features having interaction effects on the mean of the response; these are common applications in many fields such as biology. Our methods leverage machine learning mean prediction methods such as random forests and use knockoff variables as negative controls. Particularly, the definition of knockoffs for our test statistics is more flexible than the original definition of knockoffs, and we give a detailed comparison of these two definitions and discuss the advantages of our knockoffs. The satisfactory empirical performance of the proposed tests is illustrated with simulation results and an HIV (Human Immunodeficiency Virus) case study.
翻译:高维回归层的统计推断值是一个重要但尚未得到充分探讨的问题。 本文的目的是通过提出两个测试来填补这一差距, 即差异差异测试和差异差异Breusch- Pagan测试, 用于评估高维回归层的三重风险测试。 以前的测试是, 关注的解释性特征是否与反应变量的有条件差异相关联, 而后一种测试则是, 回归层的三重风险测试, 即众所周知的布雷什- 帕根测试问题。 为了正式建立测试, 我们得出了严格的P值和测试尺寸, 分析了在非对等异差异差异异异异异异异异异异异异异异异异异异异的测试中测试能力。 这种模型设置考虑到高维异异异异的高度应用, 对反应的平均值具有互动效应; 这些都是许多领域( 如生物学等) 常见的应用。 我们的方法利用随机森林等机器的预测方法, 以及将击倒变量作为负面控制。 特别是,, 在非对高维度数据生成模型的模型数据进行测试模型进行分析时, 我们提出的模拟测试定义比我们最初的模型测试, 的模拟测试的模型分析, 的逻辑分析是更灵活地讨论。