As its name suggests, sufficient dimension reduction (SDR) targets to estimate a subspace from data that contains all information sufficient to explain a dependent variable. Ample approaches exist to SDR, some of the most recent of which rely on minimal to no model assumptions. These are defined according to an optimization criterion that maximizes a nonparametric measure of association. The original estimators are nonsparse, which means that all variables contribute to the model. However, in many practical applications, an SDR technique may be called for that is sparse and as such, intrinsically performs sufficient variable selection (SVS). This paper examines how such a sparse SDR estimator can be constructed. Three variants are investigated, depending on different measures of association: distance covariance, martingale difference divergence and ball covariance. A simulation study shows that each of these estimators can achieve correct variable selection in highly nonlinear contexts, yet are sensitive to outliers and computationally intensive. The study sheds light on the subtle differences between the methods. Two examples illustrate how these new estimators can be applied in practice, with a slight preference for the option based on martingale difference divergence in the bioinformatics example.
翻译:如其名称所示,从包含足以解释一个依附变量的所有信息的数据中估算一个子空间的足够维度减少(SDR)目标从包含足够信息的数据中估算一个子空间。对特别提款权来说,存在着许多办法,其中一些最近的办法依赖于最低的假设而不是任何模型假设。这些办法根据优化标准加以界定,该标准使非参数关联度达到最大化。最初的估测器不粗略,这意味着所有变量都有助于模型。然而,在许多实际应用中,特别提款权技术可能会被要求为稀少的,因此,必然要执行足够的变量选择(SVS)。本文审视了如何构建这种稀疏的特别提款权估计符。根据不同的关联度,对三种不同的变量进行了调查:距离差变异性、马丁格差异和球变异性。模拟研究表明,这些估计器在高度非线性环境下都可实现正确的变量选择,但对于离值和计算密集度十分敏感。研究揭示了方法之间的微妙差异。有两个例子说明这些新的估计器如何在实践中应用,并略地偏向以生物差异为基础的选择。