Variable selection methods are widely used in molecular biology to detect biomarkers or to infer gene regulatory networks from transcriptomic data. Methods are mainly based on the high-dimensional Gaussian linear regression model and we focus on this framework for this review. We propose a comparison study of variable selection procedures from regularization paths by considering three simulation settings. In the first one, the variables are independent allowing the evaluation of the methods in the theoretical framework used to develop them. In the second setting, two structures of the correlation between variables are considered to evaluate how biological dependencies usually observed affect the estimation. Finally, the third setting mimics the biological complexity of transcription factor regulations, it is the farthest setting from the Gaussian framework. In all the settings, the capacity of prediction and the identification of the explaining variables are evaluated for each method. Our results show that variable selection procedures rely on statistical assumptions that should be carefully checked. The Gaussian assumption and the number of explaining variables are the two key points. As soon as correlation exists, the regularization function Elastic-net provides better results than Lasso. LinSelect, a non-asymptotic model selection method, should be preferred to the eBIC criterion commonly used. Bolasso is a judicious strategy to limit the selection of non explaining variables.
翻译:分子生物学中广泛使用变量选择方法来检测生物标志或从定律组数据中推断基因管理网络。方法主要基于高维高斯线性回归模型,我们注重这一审查框架。我们建议通过考虑三个模拟设置,比较研究从正规化路径中选择可变程序。在第一个设置中,变量是独立的,可以评价用于开发这些参数的理论框架中的方法。在第二个设置中,变量之间的两个相关结构被考虑来评价通常观察到的生物依赖性如何影响估计。最后,第三个设置模拟了正统系数条例的生物复杂性,这是高斯框架的最远的设置。在所有设置中,预测能力和解释变量的识别都对每种方法进行了评估。我们的结果显示,变量选择程序依赖于应仔细检查的统计假设。高斯假设和解释变量的数量是两个关键点。一旦存在关联性,正统化功能 Elaistic-net应提供比Lasso更好的结果。 LinSerect, 一种非理性的模型选择标准是所使用的不明智的模型选择方法。