Motivation: Bacterial community composition is commonly quantified using 16S rRNA (ribosomal ribonucleic acid) gene sequencing. One of the defining characteristics of these datasets is the phylogenetic relationships that exist between variables. Here, we demonstrate the utility of modelling phylogenetic relationships in two tasks (the two sample test and host trait prediction) using a novel application of string kernels. Results: We show via simulation studies that a kernel two-sample test using string kernels is sensitive to the phylogenetic scale of the difference between the two populations and is more powerful than tests using kernels based on popular microbial distance metrics. We also demonstrate how Gaussian process modelling can be used to infer the distribution of bacterial-host effects across the phylogenetic tree using simulations and two real host trait prediction tasks.
翻译:动机: 细菌群落构成通常使用 16S rRRNA (肋骨细胞酸) 基因序列进行量化。 这些数据集的决定性特征之一是变量之间存在的植物基因关系。 在这里,我们展示了在两个任务(两个样本测试和主主机特性预测)中建模植物基因关系的效用,使用了新颖的弦内核应用。 结果:我们通过模拟研究显示,使用弦内核的两层模样试验对两个人群之间的差异的植物基因规模十分敏感,比使用基于流行微生物距离测量的内核的试验更强大。 我们还展示了如何利用高斯进程模型来利用模拟和两个实际主机内特性预测任务来推断植物树上细菌宿主效应的分布。