High-dimensional data are commonly seen in modern statistical applications, variable selection methods play indispensable roles in identifying the critical features for scientific discoveries. Traditional best subset selection methods are computationally intractable with a large number of features, while regularization methods such as Lasso, SCAD and their variants perform poorly in ultrahigh-dimensional data due to low computational efficiency and unstable algorithm. Sure screening methods have become popular alternatives by first rapidly reducing the dimension using simple measures such as marginal correlation then applying any regularization methods. A number of screening methods for different models or problems have been developed, however, none of the methods have targeted at data with heavy tailedness, which is another important characteristics of modern big data. In this paper, we propose a robust distance correlation (``RDC'') based sure screening method to perform screening in ultrahigh-dimensional regression with heavy-tailed data. The proposed method shares the same good properties as the original model-free distance correlation based screening while has additional merit of robustly estimating the distance correlation when data is heavy-tailed and improves the model selection performance in screening. We conducted extensive simulations under different scenarios of heavy tailedness to demonstrate the advantage of our proposed procedure as compared to other existing model-based or model-free screening procedures with improved feature selection and prediction performance. We also applied the method to high-dimensional heavy-tailed RNA sequencing (RNA-seq) data of The Cancer Genome Atlas (TCGA) pancreatic cancer cohort and RDC was shown to outperform the other methods in prioritizing the most essential and biologically meaningful genes.
翻译:在现代统计应用中,人们通常看到高维数据,不同的选择方法在确定科学发现的关键特征方面发挥着不可或缺的作用;传统的最佳子集选择方法在计算上非常棘手,具有大量特征,而诸如Lasso、SCAD等正规化方法及其变体由于计算效率低和算法不稳定,在超高维数据方面表现不佳;确实的筛选方法已经成为流行的替代方法,首先采用诸如边际相关性等简单措施迅速减少这一层面,然后采用任何正规化方法;但已经为不同模型或问题制定了一些筛选方法,但没有任何一种方法针对严重尾随的数据,这是现代大数据的另一个重要特征;在本文件中,我们提出了一种强有力的远程关联(“RRDC' ) 及其变异端方法,以超高标准数据在超高标准回归中进行筛选。 拟议的方法与原始无模型的远程相关筛查具有相同的优点,同时,在数据超标准精细时,没有针对重尾随尾跟踪数据,改进模型选择的模型性能。我们在不同情景下进行了广泛的模拟模拟,并且根据其他标准,也采用了高标准测算方法。