Despite the availability of numerous statistical and machine learning tools for joint feature modeling, many scientists investigate features marginally, i.e., one feature at a time. This is partly due to training and convention but also roots in scientists' strong interests in simple visualization and interpretability. As such, marginal feature ranking for some predictive tasks, e.g., prediction of cancer driver genes, is widely practiced in the process of scientific discoveries. In this work, we focus on marginal ranking for binary prediction, the arguably most common predictive tasks. We argue that the most widely used marginal ranking criteria, including the Pearson correlation, the two-sample t test, and two-sample Wilcoxon rank-sum test, do not fully take feature distributions and prediction objectives into account. To address this gap in practice, we propose two ranking criteria corresponding to two prediction objectives: the classical criterion (CC) and the Neyman-Pearson criterion (NPC), both of which use model-free nonparametric implementations to accommodate diverse feature distributions. Theoretically, we show that under regularity conditions both criteria achieve sample-level ranking consistent with their population-level counterpart with high probability. Moreover, NPC is robust to sampling bias when the two class proportions in a sample deviate from those in the population. This property endows NPC good potential in biomedical research where sampling bias is common. We demonstrate the use and relative advantages of CC and NPC in simulation and real data studies. Our model-free objective-based ranking idea is extendable to ranking feature subsets and generalizable to other prediction tasks and learning objectives.
翻译:尽管有许多统计和机器学习工具可供共同进行特征建模,但许多科学家调查的特征并不多,即一个特征,这部分是由于培训和会议,但也源于科学家对简单可视化和可解释性的强烈兴趣。因此,在科学发现过程中,对一些预测性任务,例如癌症驱动基因的预测,普遍采用边缘特征排名。在这项工作中,我们侧重于二进制预测的边际排名,这可以说是最常见的预测性任务。我们认为,最广泛使用的边际排名标准,包括皮尔逊相关标准、双模标准T测试和双模威尔科松级和两模标准,都源于科学家在简单可视化和可解释性方面的强烈兴趣。因此,为了解决这一差距,我们提出了两个与预测性目标相对应的排名标准:典型标准(CC)和Neyman-Pearson标准(NPC),两者都使用无模型和非参数的直径比标准,以适应不同的特征分布。理论上,我们指出,在定期性标准下,两种标准中的相对偏差值比值在NPC中,其普通的比值排序中,其比值比值比值比值比值比值比值比值比值比值比值比值比值比值比值比值比值比值比值比值比值比值比值比值比值比值比值比值比值比值比值比值比值比值比值比值比值,比值比值比值比值比值比值比值比值比值比值比值比值比值比值比值比值比值比值比值比值比值比值比值比值比值比值比值比值比值比值比值比值比值比值比值比值比值比值比值比值比值比值比值比值比值比值比值比值比值比值比值比值比值比值比值比值比值比值比值比值比值比值比值比值比值比值比值比值比值比值比值比值比值比值比值比值比值比值比值比值比值比值比值比值比值比值比值比值比值比值比值比值比值比值比值比值比值比值