Linear Discriminant Analysis (LDA) is a well-known technique for feature extraction and dimension reduction. The performance of classical LDA, however, significantly degrades on the High Dimension Low Sample Size (HDLSS) data for the ill-posed inverse problem. Existing approaches for HDLSS data classification typically assume the data in question are with Gaussian distribution and deal the HDLSS classification problem with regularization. However, these assumptions are too strict to hold in many emerging real-life applications, such as enabling personalized predictive analysis using Electronic Health Records (EHRs) data collected from an extremely limited number of patients who have been diagnosed with or without the target disease for prediction. In this paper, we revised the problem of predictive analysis of disease using personal EHR data and LDA classifier. To fill the gap, in this paper, we first studied an analytical model that understands the accuracy of LDA for classifying data with arbitrary distribution. The model gives a theoretical upper bound of LDA error rate that is controlled by two factors: (1) the statistical convergence rate of (inverse) covariance matrix estimators and (2) the divergence of the training/testing datasets to fitted distributions. To this end, we could lower the error rate by balancing the two factors for better classification performance. Hereby, we further proposed a novel LDA classifier De-Sparse that leverages De-sparsified Graphical Lasso to improve the estimation of LDA, which outperforms state-of-the-art LDA approaches developed for HDLSS data. Such advances and effectiveness are further demonstrated by both theoretical analysis and extensive experiments on EHR datasets.
翻译:古典LDA(HDDA)的性能在高尺寸低抽样规模(HDLSS)数据中明显下降。HDLSS数据分类的现有方法一般认为,有关数据与高斯分布有关,并处理HDLSS分类的正规化问题。然而,这些假设过于严格,无法在许多新出现的现实应用中维持,例如,利用电子健康记录(EHRs)收集的极有限的诊断患有或不患有目标疾病的病人的个人化预测性分析,但是,古典LDDA(HDLSS)数据在高低度低度抽样中明显下降。在本文件中,我们修订了使用个人 EHR 数据和LDA 数据分类的预测性分析问题。为了填补这一空白,我们首先研究了一种理解LDA(LDA)对任意分布数据分类的准确性,该模型提供了LDA误差率的理论上限,由两个因素加以控制:(1) 数据趋同(逆性)的统计趋同性(Over)调高度数据基数矩阵分析,我们用LDA(LDA)的低度数据分类方法对数据进行更精确化分析,我们进行更精确的分类分析,我们进行更精确的测测算。</s>