The widespread availability of high-dimensional biological data has made the simultaneous screening of many biological characteristics a central problem in computational biology and allied sciences. While the dimensionality of such datasets continues to grow, so too does the complexity of biomarker identification from exposure patterns in health studies measuring baseline confounders; moreover, doing so while avoiding model misspecification remains an issue only partially addressed. Efficient estimators capable of incorporating flexible, data adaptive regression techniques in estimating relevant components of the data-generating distribution provide an avenue for avoiding model misspecification; however, in the context of high-dimensional problems that require the simultaneous estimation of numerous parameters, standard variance estimators have proven unstable, resulting in unreliable Type-I error control even under standard multiple testing corrections. We present a general approach for applying empirical Bayes shrinkage to variance estimators of a family of efficient, asymptotically linear estimators of population intervention causal effects. Our generalization of shrinkage-based variance estimators increases inferential stability in high-dimensional settings, facilitating the application of these estimators for deriving nonparametric variable importance measures in high-dimensional biological datasets with modest sample sizes. The result is a data adaptive approach for robustly uncovering stable causal associations in high-dimensional data in studies with limited samples. Our generalized variance estimator is evaluated against alternative variance estimators in numerical experiments. Identification of biomarkers with the proposed methodology is demonstrated in an analysis of high-dimensional DNA methylation data from an observational study on the epigenetic effects of tobacco smoking.
翻译:高分辨率生物数据的广泛可得性使同时筛选许多生物特征成为计算生物学和相关科学中的一个中心问题。虽然这种数据集的维度继续增加,在测量基线混凝土的卫生研究中从接触模式中确定生物标志的复杂性也继续增加,在测量基线混凝土时,衡量基准混凝土时,衡量基准混凝土时,从接触模式特征中确定生物标志的复杂性仍然只是部分解决了问题。在估算数据生成分布的相关组成部分时采用灵活、数据适应回归技术的高效估计方法,为避免模型偏差提供了避免模型偏差的一个途径;然而,在需要同时估算众多参数的高度问题中,标准差异估测结果被证明不稳定,导致即使根据标准的多度测试校校校校校校校校,也导致类型I错误控制不可靠。我们提出了一个一般方法,即应用实验性测结果的估算结果,在高尺度的样本中,对高分辨率数据进行稳健的抽样评估。在高尺度的抽样分析中,对高分辨率数据进行适度的抽样评估。