Genomic data are subject to various sources of confounding, such as demographic variables, biological heterogeneity, and batch effects. To identify genomic features associated with a variable of interest in the presence of confounders, the traditional approach involves fitting a confounder-adjusted regression model to each genomic feature, followed by multiplicity correction. This study shows that the traditional approach was sub-optimal and proposes a new two-dimensional false discovery rate control framework (2dFDR+) that provides significant power improvement over the conventional method and applies to a wide range of settings. 2dFDR+ uses marginal independence test statistics as auxiliary information to filter out less promising features, and FDR control is performed based on conditional independence test statistics in the remaining features. 2dFDR+ provides (asymptotically) valid inference from samples in settings where the conditional distribution of the genomic variables given the covariate of interest and the confounders is arbitrary and completely unknown. To achieve this goal, our method requires the conditional distribution of the covariate given the confounders to be known or can be estimated from the data. We develop a new procedure to simultaneously select the two cutoff values for the marginal and conditional independence test statistics. 2dFDR+ is proved to offer asymptotic FDR control and dominate the power of the traditional procedure. Promising finite sample performance is demonstrated via extensive simulations and real data applications.
翻译:基因组数据受各种混杂来源的影响,例如人口变量、生物异质性和批量效应。为了确定与对混杂者感兴趣的变量相关的基因组特征,传统方法涉及将一个混杂的调整回归模型安装到每个基因组特征,随后进行多重校正。这项研究表明,传统方法不尽人意,并提出了一个新的双维假发现率控制框架(2dFDR+),它比传统方法大有权力改进,并适用于多种环境。 2dFDR+使用边际独立测试统计数据作为辅助信息,以过滤不太有希望的特征,而FDR控制则以其余特征中有条件的独立测试统计数据为基础进行。 2dFDR+提供(暂时性)从样本中得出的有效推论,在这种环境中,由于兴趣和粘结者之间的差异是任意和完全未知的。为了实现这一目标,我们的方法要求根据已知的混杂者或可从模拟中筛选出来的边际测试数据来有条件的变量分配。我们同时选择了一种新的程序,通过传统DRDR 和边际数据独立来进行测试。