使用连续生物标志物连续分布的尾数数的二元疾病预测 (Binary disease prediction using tail quantiles of the distribution of continuous biomarkers)

In the analysis of binary disease classification, single biomarkers might not have significant discriminating power and multiple biomarkers from a large set of biomarkers should be selected. Numerous approaches exist, but they merely work well for mean differences in biomarkers between cases and controls. Biological processes are however much more heterogeneous, and differences could also occur in other distributional characteristics (e.g. variances, skewness). Many machine learning techniques are better capable of utilizing these higher order distributional differences, sometimes at cost of explainability. In this study we propose quantile based prediction (QBP), a binary classification method that is based on the selection of multiple continuous biomarkers. QBP generates a single score using the tails of the biomarker distributions for cases and controls. This single score can then be evaluated by ROC analysis to investigate its predictive power. The performance of QBP is compared to supervised learning methods using extensive simulation studies, and two case studies: major depression disorder and trisomy. Simultaneously, the classification performance of the existing techniques in relation to each other is assessed. The key strengths of QBP are the opportunity to select relevant biomarkers and the outstanding classification performance in the case biomarkers predominantly show variance differences between cases and controls. When only shifts in means were present in the biomarkers, QBP obtained an inferior performance. Lastly, QBP proved to be unbiased in case of absence of disease relevant biomarkers and outperformed the other methods on the MDD case study. More research is needed to further optimize QBP, since it has several opportunities to improve its performance. Here we wanted to introduce the principle of QBP and show its potential.

翻译：在二进制疾病分类分析中,单一生物标志可能没有显著的差别性能,应该选择大量生物标志的多重生物标志。许多方法存在,但它们只是对不同案例和控制的生物标志的尾巴产生一个单一分数,但生物过程的差别要大得多,在其他分布特征(如差异、皮肤)中也可能出现差异。许多机器学习技术更能够利用这些更高的顺序分布差异,有时以解释成本为代价。在本研究中,我们建议基于定量的预测(QBP),这是一种基于选择多个连续生物标志的二进制分类方法。QBP利用生物标志分布的尾巴在案例和控制之间产生单一分数。然后,通过ROC分析来评估这一单一分数,以调查其预测性能(如差异、骨质等)。许多机器学习技术的性能与使用广泛的模拟研究以及两个案例研究(主要抑郁症和三进制)相比,现有技术的分类性能评估。QBP的关键优势是缺乏多个连续生物标志;QBP研究案例中,更能选择与生物标志性能测试的立标点。