Methods based on partial least squares (PLS) regression, which has recently gained much attention in the analysis of high-dimensional genomic datasets, have been developed since the early 2000s for performing variable selection. Most of these techniques rely on tuning parameters that are often determined by cross-validation (CV) based methods, which raises important stability issues. To overcome this, we have developed a new dynamic bootstrapbased method for significant predictor selection, suitable for both PLS regression and its incorporation into generalized linear models (GPLS). It relies on the establishment of bootstrap confidence intervals, that allows testing of the significance of predictors at preset type I risk $\alpha$, and avoids the use of CV. We have also developed adapted versions of sparse PLS (SPLS) and sparse GPLS regression (SGPLS), using a recently introduced non-parametric bootstrap-based technique for the determination of the numbers of components. We compare their variable selection reliability and stability concerning tuning parameters determination, as well as their predictive ability, using simulated data for PLS and real microarray gene expression data for PLS-logistic classification. We observe that our new dynamic bootstrapbased method has the property of best separating random noise in y from the relevant information with respect to other methods, leading to better accuracy and predictive abilities, especially for non-negligible noise levels. Keywords: Variable selection, PLS, GPLS, Bootstrap, Stability
翻译:基于部分最小方(PLS)回归法的方法最近在分析高维基因组数据集方面引起很大注意,该方法最近在分析高维基因组数据集方面引起很大注意,自2000年代初以来,已经为进行变量选择开发了用于进行变量选择的技术,这些技术大多依赖以交叉校验(CV)为基础的方法决定的调制参数,这引起了重要的稳定性问题。为了克服这一点,我们开发了一种新的动态靴式方法,用于进行重要的预测或选择,既适合PLS回归,也适合将其纳入一般线性模型(GPLS)。它依靠建立靴式套式信任间隔,从而能够测试预设型I类预测器的预测器的重要性,从而测试美元/ALpha$,并避免使用CV。我们还开发了稀释式PLS(SPLS)和稀释 GPLPS(SG)回归法的调控法版本,以及用于确定组件数量的非实微缩型PLS(BLS)的精确度数据,我们在PLS-Ris-Restrial Streportal 和PLS(PLS-ral-ral)等数据方面,我们用新的模拟数据比重数据,在进行更精确性数据方面,我们采用最佳的方法。