从不同丰度到MtGWAS:精确和可扩展的代谢数据方法与不可忽略的缺失观测和潜在因素 (From differential abundance to mtGWAS: accurate and scalable methodology for metabolomics data with non-ignorable missing observations and latent factors)

2022 年 5 月 24 日

From differential abundance to mtGWAS: accurate and scalable methodology for metabolomics data with non-ignorable missing observations and latent factors

翻译：从不同丰度到MtGWAS:精确和可扩展的代谢数据方法与不可忽略的缺失观测和潜在因素

Shangshu Zhao,Kedir Turi,Tina Hartert,Carole Ober,Klaus Bonnelykke,Bo Chawes,Hans Bisgaard,Chris McKennan

from arxiv, 19 pages of main text; 89 pages with supplement; 3 figures and 2 tables

Metabolomics is the high-throughput study of small molecule metabolites. Besides offering novel biological insights, these data contain unique statistical challenges, the most glaring of which is the many non-ignorable missing metabolite observations. To address this issue, nearly all analysis pipelines first impute missing observations, and subsequently perform analyses with methods designed for complete data. While clearly erroneous, these pipelines provide key practical advantages not present in existing statistically rigorous methods, including using both observed and missing data to increase power, fast computation to support phenome- and genome-wide analyses, and streamlined estimates for factor models. To bridge this gap between statistical fidelity and practical utility, we developed MS-NIMBLE, a statistically rigorous and powerful suite of methods that offers all the practical benefits of imputation pipelines to perform phenome-wide differential abundance analyses, metabolite genome-wide association studies (mtGWAS), and factor analysis with non-ignorable missing data. Critically, we tailor MS-NIMBLE to perform differential abundance and mtGWAS in the presence of latent factors, which reduces biases and improves power. In addition to proving its statistical and computational efficiency, we demonstrate its superior performance using three real metabolomic datasets.

翻译：分子代谢物的高通量研究是小分子代谢物的高通量研究。这些数据除了提供新的生物学见解外,还包含独特的统计挑战,其中最明显的是许多不可忽略的代谢物缺失观测。为了解决这个问题,几乎所有分析管道首先估算缺失的观测结果,然后用为完整数据设计的方法进行分析。这些管道虽然明显是错误的,但提供了在现有的统计严谨方法中并不存在的关键实际优势,包括使用观测到的数据和缺失的数据来增加功率,快速计算以支持全个人和基因组的分析,以及要素模型的简化估计。为了缩小统计真实性和实用实用性之间的差距,我们开发了MS-NNPBL,这是一套统计上严格而有力的方法,它提供了所有实际的效益,即利用浸泡管道进行全苯异性丰度分析,对基因组全局进行代谢研究(MmtGWAS),以及利用非重要缺失的数据进行要素分析。关键地说,我们调整MS-NNPBWE,以便在存在潜在因素的情况下进行差异性丰度和 mtGWAS,从而降低偏向和提高其真实效率。