Microbiome compositional data are often high-dimensional, sparse, and exhibit pervasive cross-sample heterogeneity. Generative modeling is a popular approach to analyze such data, and effective generative models must accurately characterize these key features. While high-dimensionality and abundance of zeros have received much attention, existing models often lack flexibility in capturing complex cross-sample variability. This limitation can affect statistical efficiency and lead to misleading conclusions in tasks like differential abundance analysis, clustering, and network analysis. We introduce a generative model, the "logistic-tree normal" (LTN) model, which addresses this issue and effectively captures key characteristics of microbiome data, including abundance of zeros. LTN employs a tree-based decomposition to aggregate sparse taxa counts and uses a (multivariate) logistic-normal distribution at tree splits, allowing for flexible covariance adjustments among taxa as needed. The latent Gaussian structure of LTN enables the incorporation of multivariate analysis tools that enforce sparsity or low-rank covariance assumptions. As a versatile, fully generative model, LTN supports a wide range of applications and offers efficient Bayesian inference computational recipes through conjugate blocked Gibbs sampling with P\'olya-Gamma augmentation. We demonstrate application of LTN in a compositional mixed-effects model for differential abundance analysis using numerical experiments and a reanalysis of the infant cohort in the DIABIMMUNE study. Our findings illustrate that LTN, by adequately accounting for cross-sample heterogeneity, appropriately generates the proportion of zeros without requiring an explicit zero-inflation component, confirming a recent viewpoint that "zero-inflation" in count-based sequencing data are often results of unaccounted cross-sample variation.
翻译:暂无翻译