How to select the active variables which have significant impact on the event of interest is a very important and meaningful problem in the statistical analysis of ultrahigh-dimensional data. Sure independent screening procedure has been demonstrated to be an effective method to reduce the dimensionality of data from a large scale to a relatively moderate scale. For censored survival data, the existing screening methods mainly adopt the Kaplan--Meier estimator to handle censoring, which may not perform well for scenarios which have heavy censoring rate. In this article, we propose a model-free screening procedure based on the Hilbert-Schmidt independence criterion (HSIC). The proposed method avoids the complication to specify an actual model from a large number of covariates. Compared with existing screening procedures, this new approach has several advantages. First, it does not involve the Kaplan--Meier estimator, thus its performance is much more robust for the cases with a heavy censoring rate. Second, the empirical estimate of HSIC is very simple as it just depends on the trace of a product of Gram matrices. In addition, the proposed procedure does not require any complicated numerical optimization, so the corresponding calculation is very simple and fast. Finally, the proposed procedure which employs the kernel method is substantially more resistant to outliers. Extensive simulation studies demonstrate that the proposed method has favorable exhibition over the existing methods. As an illustration, we apply the proposed method to analyze the diffuse large-B-cell lymphoma (DLBCL) data and the ovarian cancer data.
翻译:在超高维数据的统计分析中,如何选择具有显着影响的活动变量是一个非常重要和有意义的问题。 确定独立筛选程序已被证明是从大规模数据减少到相对中等规模的数据的一种有效方法。对于被审查的生存数据,现有的筛选方法主要采用 Kaplan-Meier 估计器来处理审查,但对于具有重审查率的情况可能效果不佳。在本文中,我们提出了一种基于希尔伯特-施密特独立性准则(HSIC)的无模型筛选程序。新方法避免了从大量协变量中规定实际模型的复杂性。与现有的筛选程序相比,这种新方法具有几个优点。首先,它不涉及 Kaplan-Meier 估计器,因此其性能对于具有大量审查率的情况要更加强壮。其次,HSIC 的经验估计非常简单,因为它只依赖于 Gram 矩阵的乘积的痕迹。此外,所提出的程序不需要任何复杂的数值优化,因此相应的计算非常简单快捷。最后,采用核方法的所提出的程序对离群值更加具有抗干扰性。广泛的模拟研究表明,所提出的方法比现有方法具有更有利的效果。例如,我们应用所提出的方法来分析弥漫性大B细胞淋巴瘤(DLBCL) 和卵巢癌数据。