In microbiome studies, one of the ways of studying bacterial abundances is to estimate bacterial composition based on the sequencing read counts. Various transformations are then applied to such compositional data for downstream statistical analysis, among which the centered log-ratio (clr) transformation is most commonly used. Due to limited sequencing depth and DNA dropouts, many rare bacterial taxa might not be captured in the final sequencing reads, which results in many zero counts. Naive composition estimation using count normalization leads to many zero proportions, which makes clr transformation infeasible. This paper proposes a multi-sample approach to estimation of the clr matrix directly in order to borrow information across samples and across species. Empirical results from real datasets suggest that the clr matrix over multiple samples is approximately low rank, which motivates a regularized maximum likelihood estimation with a nuclear norm penalty. An efficient optimization algorithm using the generalized accelerated proximal gradient is developed. Theoretical upper bounds of the estimation errors and of its corresponding singular subspace errors are established. Simulation studies demonstrate that the proposed estimator outperforms the naive estimators. The method is analyzed on Gut Microbiome dataset and the American Gut project.
翻译:在微生物研究中,研究细菌丰度的方法之一是根据测序计数来估计细菌的构成。然后对下游统计分析的这种组成数据应用各种变异,其中最常用的是核心对数(cler)变异,由于测序深度有限和脱DNA,在最后测序读数中可能无法捕捉许多稀有的细菌分类,从而得出许多零点数。使用计数的正常化估算得出许多零比例,使电动变异不可行。本文建议采用多种抽样方法直接估计电动矩阵,以借取样品和不同物种的信息。实际数据集的经验结果表明,多样品的云层矩阵大致处于低位,这促使在核规范下定期进行最大可能性估计,从而得出许多零点数。利用普遍加速的先质梯度制定了高效的优化算法。确定了估计误差及其相应的奇异次空间差的理论上限。模拟研究显示,拟议的测算器比美国天性地基测量仪项目高出了天性地基数据。该方法是对核规范进行的分析。该方法进行了分析。