High throughput sequencing (HTS)-based technology enables identifying and quantifying non-culturable microbial organisms in all environments. Microbial sequences have enhanced our understanding of the human microbiome, the soil and plant environment, and the marine environment. All molecular microbial data pose statistical challenges due to contamination sequences from reagents, batch effects, unequal sampling, and undetected taxa. Technical biases and heteroscedasticity have the strongest effects, but different strains across subjects and environments also make direct differential abundance testing unwieldy. We provide an introduction to a few statistical tools that can overcome some of these difficulties and demonstrate those tools on an example. We show how standard statistical methods, such as simple hierarchical mixture and topic models, can facilitate inferences on latent microbial communities. We also review some nonparametric Bayesian approaches that combine visualization and uncertainty quantification. The intersection of molecular microbial biology and statistics is an exciting new venue. Finally, we list some of the important open problems that would benefit from more careful statistical method development.
翻译:微生物序列提高了我们对人类微生物、土壤和植物环境以及海洋环境的了解。所有分子微生物数据都由于试剂污染序列、批量效应、不平等采样和未探测的分类而构成统计挑战。技术偏差和不测性具有最强的效果,但不同学科和环境的不同压力也使得直接差异丰度测试不易操作。我们介绍了一些能够克服其中一些困难并展示这些工具的统计工具。我们展示了标准统计方法,例如简单的等级混合物和专题模型,如何便利对潜在微生物群落的推断。我们还审查了将可视化和不确定性量化相结合的一些非参数性巴耶斯方法。分子微生物生物学和统计的交叉性是一个令人振奋的新地点。最后,我们列举了一些重要的公开问题,这些问题将受益于更谨慎的统计方法开发。