High-dimensional data sets are often available in genome-enabled predictions. Such data sets include nonlinear relationships with complex dependence structures. For such situations, vine copula based (quantile) regression is an important tool. However, the current vine copula based regression approaches do not scale up to high and ultra-high dimensions. To perform high-dimensional sparse vine copula based regression, we propose two methods. First, we show their superiority regarding computational complexity over the existing methods. Second, we define relevant, irrelevant, and redundant explanatory variables for quantile regression. Then we show our method's power in selecting relevant variables and prediction accuracy in high-dimensional sparse data sets via simulation studies. Next, we apply the proposed methods to the high-dimensional real data, aiming at the genomic prediction of maize traits. Some data-processing and feature extraction steps for the real data are further discussed. Finally, we show the advantage of our methods over linear models in the real data application.
翻译:以基因组为基础的预测中往往可以找到高维数据集。 这些数据组包括与复杂依赖结构的非线性关系。 对于这种情况, 以 vine cocula (量基) 为基础的回溯是一个重要工具。 但是, 以 vine cocula 为基础的回归方法并没有达到高和超高的尺寸。 为了进行高维稀疏 vine coula 基的回归, 我们建议了两种方法。 首先, 我们显示它们在计算复杂性方面优于现有方法。 第二, 我们定义了量化回归的相关、 不相关和多余的解释变量。 然后, 我们通过模拟研究来显示我们的方法在选择相关变量和预测高维分散数据集的准确性方面的力量。 接下来, 我们把建议的方法应用到高维真实数据中, 目的是对玉米特征进行基因学预测。 我们进一步讨论了真实数据的一些数据处理和特征提取步骤。 最后, 我们展示了我们的方法在真实数据应用中比线性模型的优势 。