In microbiome and genomic studies, the regression of compositional data has been a crucial tool for identifying microbial taxa or genes that are associated with clinical phenotypes. To account for the variation in sequencing depth, the classic log-contrast model is often used where read counts are normalized into compositions. However, zero read counts and the randomness in covariates remain critical issues. In this article, we introduce a surprisingly simple, interpretable, and efficient method for the estimation of compositional data regression through the lens of a novel high-dimensional log-error-in-variable regression model. The proposed method provides both corrections on sequencing data with possible overdispersion and simultaneously avoids any subjective imputation of zero read counts. We provide theoretical justifications with matching upper and lower bounds for the estimation error. The merit of the procedure is illustrated through real data analysis and simulation studies.
翻译:在微生物和基因组研究中,合成数据的回归是查明微生物分类或与临床苯型有关的基因的关键工具。考虑到测序深度的差异,通常使用经典对数模型,将读数标准化为构成。然而,零读数和共变任意性仍然是关键问题。在本篇文章中,我们引入了一种令人惊讶的简单、可解释和有效的方法,通过新颖的高维对数误差可变回归模型的透镜来估计构成数据回归。拟议方法既对测序数据进行校正,又可能超分,同时避免对零读数进行主观估计。我们提供了理论依据,将测算误差的上限和下限相匹配。通过真实的数据分析和模拟研究来说明该程序的优点。