The Dirichlet-multinomial (DM) distribution plays a fundamental role in modern statistical methodology development and application. Recently, the DM distribution and its variants have been used extensively to model multivariate count data generated by high-throughput sequencing technology in omics research due to its ability to accommodate the compositional structure of the data as well as overdispersion. A major limitation of the DM distribution is that it is unable to handle excess zeros typically found in practice which may bias inference. To fill this gap, we propose a novel Bayesian zero-inflated DM model for multivariate compositional count data with excess zeros. We then extend our approach to regression settings and embed sparsity-inducing priors to perform variable selection for high-dimensional covariate spaces. Throughout, modeling decisions are made to boost scalability without sacrificing interpretability or imposing limiting assumptions. Extensive simulations and an application to a human gut microbiome data set are presented to compare the performance of the proposed method to existing approaches. We provide an accompanying R package with a user-friendly vignette to apply our method to other data sets.
翻译:在现代统计方法的开发和应用中,Drichlet-多式(DM)分布在现代统计方法的开发和应用中起着根本作用。最近,DM分布及其变体被广泛用来模拟在食气研究中高通量排序技术产生的多变量计数数据,因为其能够容纳数据的构成结构以及过度分散。DM分布的一个主要限制是它无法处理通常在实践上发现的、可能偏向推论的超值零。为了填补这一空白,我们提议了一个新的Bayesian零充气DM模型,用于多变量组成计数数据为零。我们随后将我们的方法推广到回归设置和嵌入电磁感应前,以便对高维度共变空间进行可变的选择。总的来说,模型决定是为了提高可缩性,同时不牺牲可解释性或强加限制的假设。对人类直肠微生物数据集进行了广泛的模拟和应用,以将拟议方法的性能与现有方法进行比较。我们提供了附带的R软件包,并配有用户友好的维格,以便将我们的方法应用于其他数据集。</s>