Missing information is inevitable in real-world data sets. While imputation is well-suited and theoretically sound for statistical inference, its relevance and practical implementation for out-of-sample prediction remains unsettled. We provide a theoretical analysis of widely used data imputation methods and highlight their key deficiencies in making accurate predictions. Alternatively, we propose adaptive linear regression, a new class of models that can be directly trained and evaluated on partially observed data, adapting to the set of available features. In particular, we show that certain adaptive regression models are equivalent to impute-then-regress methods where the imputation and the regression models are learned simultaneously instead of sequentially. We validate our theoretical findings and adaptive regression approach with numerical results with real-world data sets.
翻译:在现实世界的数据集中,缺失的信息是不可避免的。虽然估算在理论上对统计推理来说是完全合适和理论上合理的,但其相关性和对超出抽样的预测的实际实施仍未确定。我们对广泛使用的数据估算方法进行理论分析,并突出其在准确预测方面的主要缺陷。或者,我们建议采用适应性线性回归,这是一种新类型的模型,可以直接培训和评估部分观察到的数据,并适应成套现有特征。特别是,我们表明,某些适应性回归模型相当于预测性-后回归方法,即同时学习估算和回归模型,而不是按顺序进行。我们用真实世界数据集来验证我们的理论发现和适应性回归方法,并用数字结果来验证。