Sequencing-based technologies provide an abundance of high-dimensional biological datasets with skewed and zero-inflated measurements. Classification of such data with linear discriminant analysis leads to poor performance due to the violation of the Gaussian distribution assumption. To address this limitation, we propose a new semiparametric discriminant analysis framework based on the truncated latent Gaussian copula model that accommodates both skewness and zero inflation. By applying sparsity regularization, we demonstrate that the proposed method leads to the consistent estimation of classification direction in high-dimensional settings. On simulated data, the proposed method shows superior performance compared to the existing method. We apply the method to discriminate healthy controls from patients with Crohn's disease based on microbiome data and to identify genera with the most influence on the classification rule.
翻译:以测距为基础的测距技术提供了大量高维生物数据集,有偏斜和零膨胀的测量数据。用线性分析对这些数据进行分类,导致由于违反高斯分布假设而导致性能不佳。为解决这一局限性,我们提议根据短短潜潜潜伏高斯阳极模型建立一个新的半对称性分析框架,既顾及偏差,又顾及零通货膨胀。我们通过应用宽度规范,表明拟议方法导致对高维环境分类方向的一致估计。在模拟数据中,拟议方法显示比现有方法的性能优劣。我们采用这种方法,根据微生物数据歧视克罗恩病病人的健康控制,并查明对分类规则影响最大的基因。