In recent years microbiome studies have become increasingly prevalent and large-scale. Through high-throughput sequencing technologies and well-established analytical pipelines, relative abundance data of operational taxonomic units and their associated taxonomic structures are routinely produced. Since such data can be extremely sparse and high dimensional, there is often a genuine need for dimension reduction to facilitate data visualization and downstream statistical analysis. We propose Principal Amalgamation Analysis (PAA), a novel amalgamation-based and taxonomy-guided dimension reduction paradigm for microbiome data. Our approach aims to aggregate the compositions into a smaller number of principal compositions, guided by the available taxonomic structure, by minimizing a properly measured loss of information. The choice of the loss function is flexible and can be based on familiar diversity indices for preserving either within-sample or between-sample diversity in the data. To enable scalable computation, we develop a hierarchical PAA algorithm to trace the entire trajectory of successive simple amalgamations. Visualization tools including dendrogram, scree plot, and ordination plot are developed. The effectiveness of PAA is demonstrated using gut microbiome data from a preterm infant study and an HIV infection study.
翻译:近年来,微生物研究日益普遍和大规模。通过高通量测序技术和完善的分析管道,经常生成可操作的分类单位及其相关分类结构的相对丰度数据。由于这些数据可能极为稀少和高维,因此往往真正需要减少维度,以便利数据的可视化和下游统计分析。我们提议了首席综合分析(PAA),这是微生物数据以新颖的混合为基础、以分类为指南的减少维度模式。我们的方法是,在现有分类结构的指导下,通过尽可能减少适当测量的信息损失,将成分合并成较少的主要成分。损失功能的选择是灵活的,并且可以基于熟悉的多样性指数,以保存数据内分布或分布在数据中的多样性。为了进行可测量的计算,我们开发了按等级排列的PAAA算法,以追踪连续的简单合并轨迹的整个轨迹。我们开发了可视化工具,包括登德罗格、缩图和调控图。通过对婴儿前的微生物感染进行的一项研究,展示了PAAA的功效。