In bioinformatics, the rapid development of sequencing technology has enabled us to collect an increasing amount of omics data. Classification based on omics data is one of the central problems in biomedical research. However, omics data usually has a limited sample size but high feature dimensions, and it is assumed that only a few features (biomarkers) are active, i.e. informative to discriminate between different categories (cancer subtypes, responder/non-responder to treatment, for example). Identifying active biomarkers for classification has therefore become fundamental for omics data analysis. Focusing on binary classification, we propose an innovative feature selection method aiming at dealing with the high correlations between the biomarkers. Various research has shown the notorious influence of correlated biomarkers and the difficulty of accurately identifying active ones. Our method, WLogit, consists in whitening the design matrix to remove the correlations between biomarkers, then using a penalized criterion adapted to the logistic regression model to select features. The performance of WLogit is assessed using synthetic data in several scenarios and compared with other approaches. The results suggest that WLogit can identify almost all active biomarkers even in the cases where the biomarkers are highly correlated, while the other methods fail, which consequently leads to higher classification accuracy. The performance is also evaluated on the classification of two Lymphoma subtypes, and the obtained classifier also outperformed other methods. Our method is implemented in the \texttt{WLogit} R package available from the Comprehensive R Archive Network (CRAN).
翻译:在生物信息学中,测序技术的迅速发展使我们能够收集越来越多的动脉数据。根据动脉数据进行分类是生物医学研究的中心问题之一。但是,动脉数据通常具有有限的抽样规模,但具有很高的特征层面,而且假定只有少数特征(生物标志)是活跃的,也就是说,对不同类别(例如癌症子类型、反应者/不响应者)进行区分的信息化;因此,确定用于分类的积极生物标志已成为对动脉数据分析的基础。以二进制分类为重点,我们提出创新特征选择方法,旨在处理生物标志之间的高度关联。各种研究表明,相关生物标志的影响臭名昭著,而且难以准确地识别活跃的特征。我们的方法WLogit是将设计矩阵白化,以去除生物标志之间的关联,然后使用与逻辑回归模型相适应的处罚标准来选择特性。WLogit的性能评估方法是合成数据,甚至与其他方法相比较。结果显示,在生物数据分类中,直径直径直径直的精确性方法是生物数据,而在生物数据中,直径直径直径直的亚。