Missing values are a common issue in real-world datasets. The gold standard for dealing with missing data in inference is to assume that the data is missing at random and apply an impute-then-estimate procedure. In this paper, we evaluate the relevance of the assumptions and methods developed in inference for prediction tasks. We first} provide a theoretical analysis of impute-then-regress methods and highlight their successes and failures in making accurate predictions. We propose adaptive linear regression, a new class of models that adapt to the set of available features and can be applied on partially observed data directly. We show that adaptive linear regression can be equivalent to impute-then-regress methods where the imputation and the linear regression models are learned simultaneously instead of sequentially. We leverage this joint-impute-then-regress interpretation to generalize our framework to non-linear models. We validate our theoretical findings and adaptive regression approaches with extensive numerical results on synthetic, semi-synthetic, and real-world datasets. Among others, in settings where data is strongly not missing at random, our methods achieve a 6\% improvement in out-of-sample accuracy.
翻译:缺少的值是真实世界数据集中常见的问题。 处理缺失的数据的黄金标准推论是假设数据随机缺失,并采用直线估算程序。 在本文中,我们评估为预测任务而开发的假设和方法的相关性。 我们首先从理论角度分析估算后回归法,并突出其在准确预测方面的成败和失败。 我们提出了适应性线回归法, 这是一种适应性直线回归法, 适应现有特征的新型模型, 可以直接应用于部分观测的数据。 我们表明, 适应性线回归法可以等同于同时学习估算和线性回归模型的模拟后回归法, 而不是按顺序学习。 我们利用这种联合假设后回归法解释将我们的框架推广到非线性模型。 我们验证我们的理论发现和适应性回归法, 其合成、 半合成和真实世界数据集上的广泛数字结果。 除其他外, 在数据明显没有随机缺失的情况下, 我们的方法实现了外部精确度的6 ⁇ 。