High-dimensional compositional data are commonplace in the modern omics sciences amongst others. Analysis of compositional data requires a proper choice of orthonormal coordinate representation as their relative nature is not compatible with the direct use of standard statistical methods. Principal balances, a specific class of log-ratio coordinates, are well suited to this context since they are constructed in such a way that the first few coordinates capture most of the variability in the original data. Focusing on regression and classification problems in high dimensions, we propose a novel Partial Least Squares (PLS) based procedure to construct principal balances that maximize explained variability of the response variable and notably facilitates interpretability when compared to the ordinary PLS formulation. The proposed PLS principal balance approach can be understood as a generalized version of common logcontrast models, since multiple orthonormal (instead of one) logcontrasts are estimated simultaneously. We demonstrate the performance of the method using both simulated and real data sets.
翻译:高维构成数据在现代奥米科学中很常见。对组成数据的分析要求正确选择正正态协调代表,因为其相对性质与直接使用标准统计方法不相容。主要平衡(一个特定的log-ratio坐标类别)非常适合这一背景,因为这些平衡的构建方式使最初的少数几个坐标能够捕捉原始数据的大部分变异性。我们建议以高维的回归和分类问题为重点,采用新的基于部分最小方(PLS)的程序来构建主要平衡,以最大限度地解释反应变量的变异性,特别是在与普通 PLS 配制相比方便解释性。拟议的PLS主要平衡方法可以被理解为通用日志模型的通用版本,因为多个正态(而不是一个)对日志连接值同时进行估算。我们用模拟数据集和真实数据集来展示该方法的性能。