Multi-output regression seeks to borrow strength and leverage commonalities across different but related outputs in order to enhance learning and prediction accuracy. A fundamental assumption is that the output/group membership labels for all observations are known. This assumption is often violated in real applications. For instance, in healthcare datasets, sensitive attributes such as ethnicity are often missing or unreported. To this end, we introduce a weakly-supervised multi-output model based on dependent Gaussian processes. Our approach is able to leverage data without complete group labels or possibly only prior belief on group memberships to enhance accuracy across all outputs. Through intensive simulations and case studies on an Insulin, Testosterone and Bodyfat dataset, we show that our model excels in multi-output settings with missing labels, while being competitive in traditional fully labeled settings. We end by highlighting the possible use of our approach in fair inference and sequential decision-making.
翻译:多产出回归试图借取力量,利用不同但相关产出的共性,以提高学习和预测的准确性。一个基本假设是,所有观测结果的输出/组成员标签都为人所知。这一假设常常在实际应用中被违反。例如,在保健数据集中,族裔等敏感属性往往缺失或未报告。为此,我们引入了基于依赖高斯进程、监管薄弱的多产出模型。我们的方法是能够利用数据,而没有完整的组标签,或者可能只是事先对组成员持有信念,以提高所有产出的准确性。我们通过对因苏林、睾丸酮和体型脂肪数据集的密集模拟和案例研究,显示我们的模型在多产出设置中优异于缺失的标签,同时在传统的全标签环境中具有竞争力。我们最后强调在公平推论和顺序决策中可能使用我们的方法。