We present a robust framework to perform linear regression with missing entries in the features. By considering an elliptical data distribution, and specifically a multivariate normal model, we are able to conditionally formulate a distribution for the missing entries and present a robust framework, which minimizes the worst case error caused by the uncertainty about the missing data. We show that the proposed formulation, which naturally takes into account the dependency between different variables, ultimately reduces to a convex program, for which a customized and scalable solver can be delivered. In addition to a detailed analysis to deliver such solver, we also asymptoticly analyze the behavior of the proposed framework, and present technical discussions to estimate the required input parameters. We complement our analysis with experiments performed on synthetic, semi-synthetic, and real data, and show how the proposed formulation improves the prediction accuracy and robustness, and outperforms the competing techniques. Missing data is a common problem associated with many datasets in machine learning. With the significant increase in using robust optimization techniques to train machine learning models, this paper presents a novel robust regression framework that operates by minimizing the uncertainty associated with missing data. The proposed approach allows training models with incomplete data, while minimizing the impact of uncertainty associated with the unavailable data. The ideas developed in this paper can be generalized beyond linear models and elliptical data distributions.
翻译:我们提出一个强大的框架,以进行线性回归,在特征中缺少条目。 通过考虑椭圆数据分布,特别是多变量的正常模型,我们能够有条件地制定缺失条目的分布,并提供一个强有力的框架,最大限度地减少缺失数据不确定性造成的最坏案例错误。我们表明,拟议公式自然考虑到不同变量之间的依赖性,最终会降低到一个连接程序,可以交付一个定制和可缩放的解决方案。除了详细分析以提供这种解析器之外,我们还对拟议框架的行为进行零星分析,并提出技术讨论以估计所需的输入参数。我们用合成、半合成和真实数据进行的实验来补充我们的分析,并表明拟议公式如何改善预测的准确性和稳健性,并超越了相互竞争的技术。缺失数据是许多机器学习数据集的一个常见问题。在培训机器学习模型时,除了使用强健的优化技术外,我们还可以对拟议框架进行简单化分析,通过最大限度地减少与缺失数据相关的不确定性相关的数据分布模型来运行。 拟议的模型可以最大限度地减少与缺乏的不确定性,同时将数据与缺乏的精确性模型联系起来。