Machine learning brings the hope of finding new biomarkers extracted from cohorts with rich biomedical measurements. A good biomarker is one that gives reliable detection of the corresponding condition. However, biomarkers are often extracted from a cohort that differs from the target population. Such a mismatch, known as a dataset shift, can undermine the application of the biomarker to new individuals. Dataset shifts are frequent in biomedical research, e.g. because of recruitment biases. When a dataset shift occurs, standard machine-learning techniques do not suffice to extract and validate biomarkers. This article provides an overview of when and how dataset shifts breaks machine-learning extracted biomarkers, as well as detection and correction strategies.
翻译:机器学习带来了寻找从生物医学计量丰富的组群中提取的新生物标志的希望。好的生物标志是一种能够可靠地检测相应条件的生物标志。然而,生物标志往往是从与目标人口不同的组群中提取的。这种不匹配,被称为数据集变换,会破坏生物标志对新人的应用。在生物医学研究中,数据集变换经常发生,例如由于招聘偏差。当数据集变换发生时,标准机器学习技术不足以提取和验证生物标志。本文章概述了数据集变换时间和如何打破机器学习提取生物标志,以及检测和校正战略。