Compositional data in which only the relative abundances of variables are measured are ubiquitous. In the context of health and medical compositional data, an important class of biomarkers is the log ratios between groups of variables. However, selecting log ratios that are predictive of a response variable is a combinatorial problem. Existing greedy-search based methods are time-consuming, which hinders their application to high-dimensional data sets. We propose a novel selection approach called the supervised log ratio method that can efficiently select predictive log ratios in high-dimensional settings. The proposed method is motivated by a latent variable model and we show that the log ratio biomarker can be selected via simple clustering after supervised feature screening. The supervised log ratio method is implemented in an R package, which is publicly available at \url{https://github.com/drjingma/slr}. We illustrate the merits of our approach through simulation studies and analysis of a microbiome data set on HIV infection.
翻译:组成数据是指仅测量变量相对丰度的数据, 在健康和医学组成数据的背景下,一类重要的生物标志物是变量组之间的对数比率。 然而,选择对响应变量具有预测作用的对数比率是一个组合问题。 现有的基于贪心搜索的方法耗时较长,这限制了它们对高维数据集的应用。 我们提出了一种称为监督对数比率方法的新选择方法,可以高效地选择高维设置中具有预测能力的对数比率。所提出的方法受潜变量模型启发,我们表明,可以经过监督特征筛选后通过简单聚类选择对数比率生物标志物。监督对数比率方法在 R 软件包中实现,并在 \url{https://github.com/drjingma/slr} 公开。我们通过模拟研究和关于关于 HIV 感染的微生物组数据集的分析说明了我们方法的优点。