Genomic datasets contain the effects of various unobserved biological variables in addition to the variable of primary interest. These latent variables often affect a large number of features (e.g., genes) and thus give rise to dense latent variation, which presents both challenges and opportunities for classification. Some of these latent variables may be partially correlated with the phenotype of interest and therefore helpful, while others may be uncorrelated and thus merely contribute additional noise. Moreover, whether potentially helpful or not, these latent variables may obscure weaker effects that impact only a small number of features but more directly capture the signal of primary interest. We propose the cross-residualization classifier to better account for the latent variables in genomic data. Through an adjustment and ensemble procedure, the cross-residualization classifier essentially estimates the latent variables and residualizes out their effects, trains a classifier on the residuals, and then re-integrates the the latent variables in a final ensemble classifier. Thus, the latent variables are accounted for without discarding any potentially predictive information that they may contribute. We apply the method to simulated data as well as a variety of genomic datasets from multiple platforms. In general, we find that the cross-residualization classifier performs well relative to existing classifiers and sometimes offers substantial gains.
翻译:基因组数据集除了主要利益变量外,还包含各种未观测的生物变量的影响。这些潜在变量往往影响大量特征(例如基因),从而产生大量潜在潜在变异,这既带来挑战,也带来分类机会。其中一些潜在变异可能部分地与兴趣的人类类型相关,因此是有益的,而其他变异可能不相干,从而只会增加噪音。此外,这些潜在变异可能有用与否,可能掩盖较弱的影响,只影响少数特征,但更直接地捕捉主要兴趣信号。我们建议交叉再分类分类,以更好地说明基因组数据的潜在变异,通过调整和共论程序,交叉再分类分类主要估计潜在变异,从而产生作用,而其他变异则可能不相相关,从而只会增加噪音。此外,这些潜在变异可能只影响少数特征,但更直接地捕捉到主要兴趣信号。我们建议交叉再分类分类分类分类分类,以更好地说明基因组数据的可能预测信息。通过一个调整和共论程序,跨重复的分类方法,我们有时将现有变的变式数据用于模拟数据。