Missing values are unavoidable in many applications of machine learning and present challenges both during training and at test time. When variables are missing in recurring patterns, fitting separate pattern submodels have been proposed as a solution. However, fitting models independently does not make efficient use of all available data. Conversely, fitting a single shared model to the full data set relies on imputation which often leads to biased results when missingness depends on unobserved factors. We propose an alternative approach, called sharing pattern submodels, which i) makes predictions that are robust to missing values at test time, ii) maintains or improves the predictive power of pattern submodels, and iii) has a short description, enabling improved interpretability. Parameter sharing is enforced through sparsity-inducing regularization which we prove leads to consistent estimation. Finally, we give conditions for when a sharing model is optimal, even when both missingness and the target outcome depend on unobserved variables. Classification and regression experiments on synthetic and real-world data sets demonstrate that our models achieve a favorable tradeoff between pattern specialization and information sharing.
翻译:在机器学习的许多应用中,缺失的价值是不可避免的,而且在培训期间和测试时都存在挑战。当反复模式中缺少变量时,建议采用不同的模式子模型作为解决办法。然而,独立安装模型并不能有效使用所有可用数据。相反,将单一共享模型安装到完整的数据集中,依赖于估算结果,当缺失取决于未观察到的因素时,往往会导致有偏差的结果。我们提议了另一种方法,称为共享模式子模型,即(i)对测试时缺失的值进行稳健的预测,(ii)维持或改进模式子模型的预测能力,以及(iii)描述短,能够改进解释性。参数共享是通过宽度诱导正规化来实施的,我们证明能够得出一致的估计。最后,我们为共享模型的最佳时提供条件,即使缺失和目标结果都取决于未观察到的变量。合成和真实世界数据集的分类和回归实验表明,我们的模型在模式专业化和信息共享之间实现了有利的权衡。