受监督的学习和模型分析及组成数据 (Supervised Learning and Model Analysis with Compositional Data)

The compositionality and sparsity of high-throughput sequencing data poses a challenge for regression and classification. However, in microbiome research in particular, conditional modeling is an essential tool to investigate relationships between phenotypes and the microbiome. Existing techniques are often inadequate: they either rely on extensions of the linear log-contrast model (which adjusts for compositionality, but is often unable to capture useful signals), or they are based on black-box machine learning methods (which may capture useful signals, but ignore compositionality in downstream analyses). We propose KernelBiome, a kernel-based nonparametric regression and classification framework for compositional data. It is tailored to sparse compositional data and is able to incorporate prior knowledge, such as phylogenetic structure. KernelBiome captures complex signals, including in the zero-structure, while automatically adapting model complexity. We demonstrate on par or improved predictive performance compared with state-of-the-art machine learning methods. Additionally, our framework provides two key advantages: (i) We propose two novel quantities to interpret contributions of individual components and prove that they consistently estimate average perturbation effects of the conditional mean, extending the interpretability of linear log-contrast models to nonparametric models. (ii) We show that the connection between kernels and distances aids interpretability and provides a data-driven embedding that can augment further analysis. Finally, we apply the KernelBiome framework to two public microbiome studies and illustrate the proposed model analysis. KernelBiome is available as an open-source Python package at https://github.com/shimenghuang/KernelBiome.

翻译：高通量测序数据的构成性和广度性对回归和分类提出了挑战。然而,在微生物研究中,有条件的建模是调查人型和微生物之间的关系的基本工具。现有技术往往不够:它们要么依赖线性日志调模型的扩展(该模型根据组成性进行调整,但往往无法捕捉有用的信号),要么以黑箱机器学习方法为基础(可能捕捉有用的信号,但忽视下游分析中的构成性)。我们提议KernelBiome,一个基于内核的非参数回归和分类框架,用于分析成份数据。它适合鲜少的成份数据,并能够纳入先前的知识,例如植物遗传结构。内核收集复杂的信号,包括零结构中的信号,同时自动调整模型的复杂性。我们展示了与最新机器学习方法相比的预测性或改进性表现。此外,我们的框架提供了两个关键优势:(i)我们提出两个新的数量来解释单个模型的贡献,并证明它们能够持续地将原始数据直径直径分析结果(我们持续地解释一个平均的直径直径模型) 直径模型的直径直径分析结果。我们展示了一个基本的直径直径直径模型。