Missing values arise in most real-world data sets due to the aggregation of multiple sources and intrinsically missing information (sensor failure, unanswered questions in surveys...). In fact, the very nature of missing values usually prevents us from running standard learning algorithms. In this paper, we focus on the extensively-studied linear models, but in presence of missing values, which turns out to be quite a challenging task. Indeed, the Bayes rule can be decomposed as a sum of predictors corresponding to each missing pattern. This eventually requires to solve a number of learning tasks, exponential in the number of input features, which makes predictions impossible for current real-world datasets. First, we propose a rigorous setting to analyze a least-square type estimator and establish a bound on the excess risk which increases exponentially in the dimension. Consequently, we leverage the missing data distribution to propose a new algorithm, andderive associated adaptive risk bounds that turn out to be minimax optimal. Numerical experiments highlight the benefits of our method compared to state-of-the-art algorithms used for predictions with missing values.
翻译:大部分真实世界的数据集中都出现了缺失的值,原因是多重来源和内在缺失的信息(传感器失灵,调查中未回答的问题......)的汇总。事实上,缺失值本身的性质通常使我们无法运行标准的学习算法。在本文中,我们侧重于广泛研究的线性模型,但有缺失值,结果发现这是一个相当艰巨的任务。事实上,贝耶斯规则可以被拆解成与每个缺失模式相对应的预测数的总和。这最终需要解决一些学习任务,即输入特性的指数化,这使得无法对当前真实世界数据集进行预测。首先,我们提议一个严格的设置,以分析最不平方型的天线性天线性标,并设定一个在维度上急剧增加的超风险的界限。因此,我们利用缺失的数据分布来提出一个新的算法,以及伴随变化的适应风险约束,这些值都小到最优化。数字实验突出了我们方法的效益,与用于预测缺失值的状态算法相比。