在基因组研究预测分析中处理高度关联基因 (Handling highly correlated genes in prediction analysis of genomic studies)

Background: Selecting feature genes to predict phenotypes is one of the typical tasks in analyzing genomics data. Though many general-purpose algorithms were developed for prediction, dealing with highly correlated genes in the prediction model is still not well addressed. High correlation among genes introduces technical problems, such as multi-collinearity issues, leading to unreliable prediction models. Furthermore, when a causal gene (whose variants have an actual biological effect on a phenotype) is highly correlated with other genes, most algorithms select the feature gene from the correlated group in a purely data-driven manner. Since the correlation structure among genes could change substantially when condition changes, the prediction model based on not correctly selected feature genes is unreliable. Therefore, we aim to keep the causal biological signal in the prediction process and build a more robust prediction model. Method: We propose a grouping algorithm, which treats highly correlated genes as a group and uses their common pattern to represent the group's biological signal in feature selection. Our novel grouping algorithm can be integrated into existing prediction algorithms to enhance their prediction performance. Our proposed grouping method has two advantages. First, using the gene group's common patterns makes the prediction more robust and reliable under condition change. Second, it reports whole correlated gene groups as discovered biomarkers for prediction tasks, allowing researchers to conduct follow-up studies to identify causal genes within the identified groups. Result: Using real benchmark scRNA-seq datasets with simulated cell phenotypes, we demonstrate our novel method significantly outperforms standard models in both (1) prediction of cell phenotypes and (2) feature gene selection.

翻译：背景: 选择特性基因来预测苯型是分析基因组数据的一个典型任务。尽管许多通用算法是为预测而开发的, 但仍没有很好地解决预测模型中与高度关联的基因问题。基因之间的高度关联带来了技术问题, 如多线性问题, 导致不可靠的预测模型。此外, 当因果基因( 其变异体对苯型具有实际生物影响) 与其他基因高度相关时, 大多数算法都从相关组中选择特征基因。虽然许多通用算法都是为预测而设计的, 但是在条件变化时, 基因的关联性算法可能会发生重大变化, 而基于不正确选择的特性基因的预测模型则仍然不可靠。因此, 我们的目标是在预测过程中保持因果性生物信号, 并建立一个更可靠的预测模型。方法: 我们提出一个将高度关联的基因组算法, 并使用其共同模式在特征选择中代表该组的生物信号。我们的新分类算法可以纳入现有的预测算法中, 以便提高它们的预测性表现。我们提议的基因选择方法有两种精确性, 使用共同的基因类选择方法, 在基因组中, 将基因变变变为共同的基因组中, 将基因变变变。