Advances in data collecting technologies in genomics have significantly increased the need for tools designed to study the genetic basis of many diseases. Statistical tools used to discover patterns between the expression of certain genes and the presence of diseases should ideally perform well in terms of both prediction accuracy and identification of key biomarkers. We propose a new approach for dealing with high-dimensional binary classification problems that combines ideas from regularization and ensembling. The ensembles are comprised of a relatively small number of highly accurate and interpretable models that are learned directly from the data by minimizing a global objective function. We derive the asymptotic properties of our method and develop an efficient algorithm to compute the ensembles. We demonstrate the good performance of our method in terms of prediction accuracy and identification of key biomarkers using several medical genomics datasets involving common diseases such as cancer, multiple sclerosis and psoriasis. In several applications our method could identify key biomarkers that were absent in state-of-the-art competitor methods. We develop a variable importance ranking tool that may guide the focus of researchers on the most promising genes. Based on numerical experiments we provide guidelines for the choice of the number of models in our ensembles.
翻译:暂无翻译