Missing values are unavoidable in many applications of machine learning and present a challenge both during training and at test time. When variables are missing in recurring patterns, fitting separate pattern submodels have been proposed as a solution. However, independent models do not make efficient use of all available data. Conversely, fitting a shared model to the full data set typically relies on imputation which may be suboptimal when missingness depends on unobserved factors. We propose an alternative approach, called sharing pattern submodels, which make predictions that are a) robust to missing values at test time, b) maintains or improves the predictive power of pattern submodels, and c) has a short description enabling improved interpretability. We identify cases where sharing is provably optimal, even when missingness itself is predictive and when the prediction target depends on unobserved variables. Classification and regression experiments on synthetic data and two healthcare data sets demonstrate that our models achieve a favorable trade-off between pattern specialization and information sharing.
翻译:在机器学习的许多应用中,缺失的数值是不可避免的,在培训期间和测试时间都不可避免,从而构成挑战。当反复模式中缺少变量时,建议采用不同的模式子模型作为解决办法。然而,独立模型并没有有效地利用所有可用数据。相反,将共享模型与完整的数据集相匹配通常依赖于估算,而当缺失取决于未观察到的因素时,这种估算可能不尽如人意。我们建议了另一种方法,即所谓的共享模式子模型,这种模型使预测在测试时间能够强于缺失值,(b)保持或改进模式子模型的预测力,以及(c)有一个简短的描述,能够改进解释性。我们发现一些情况,在共享方面,即使缺失本身是预测性的,而且预测目标取决于未观察到的变量时,共享是最佳的。合成数据和两个医疗保健数据集的分类和回归实验表明,我们的模型在模式专业化和信息共享之间实现了有利的权衡。