用缺失值预测什么是好的估算? (What's a good imputation to predict with missing values?) - 专知论文

会员服务 ·

0

学成 · CRAFT · 泛函 · 预测器/决策函数 · Better ·

2021 年 11 月 30 日

What's a good imputation to predict with missing values?

翻译：用缺失值预测什么是好的估算?

Marine Le Morvan,Julie Josse,Erwan Scornet,Gaël Varoquaux

How to learn a good predictor on data with missing values? Most efforts focus on first imputing as well as possible and second learning on the completed data to predict the outcome. Yet, this widespread practice has no theoretical grounding. Here we show that for almost all imputation functions, an impute-then-regress procedure with a powerful learner is Bayes optimal. This result holds for all missing-values mechanisms, in contrast with the classic statistical results that require missing-at-random settings to use imputation in probabilistic modeling. Moreover, it implies that perfect conditional imputation is not needed for good prediction asymptotically. In fact, we show that on perfectly imputed data the best regression function will generally be discontinuous, which makes it hard to learn. Crafting instead the imputation so as to leave the regression function unchanged simply shifts the problem to learning discontinuous imputations. Rather, we suggest that it is easier to learn imputation and regression jointly. We propose such a procedure, adapting NeuMiss, a neural network capturing the conditional links across observed and unobserved variables whatever the missing-value pattern. Experiments confirm that joint imputation and regression through NeuMiss is better than various two step procedures in our experiments with finite number of samples.

翻译：如何对缺少值的数据进行良好的预测? 大多数努力都侧重于首先估算以及可能和第二次对已完成的数据进行可能的预测,以预测结果。然而,这种普遍的做法没有理论依据。我们在这里显示,几乎所有估算功能, 与强健的学习者一起的估算后回归程序都是最佳的。这对所有缺失值机制来说都是最好的。这与传统的统计结果形成对照, 典型的统计结果要求错失随机假设在概率模型中使用估算值。此外, 它意味着, 完全有条件的估算并非对好预测的即时预测所必要的。事实上, 我们显示, 在完全估算的数据中, 最佳回归功能一般都是不连贯的, 这使得它很难学习。巧妙的估算而不是使回归功能不变, 仅仅将问题转向学习不连续的估算。相反, 我们建议, 更容易共同学习估算和回归。我们提议这样的程序, 调整 NeuMis, 一个神经网络, 捕捉到观察到的和不精确的回归功能, 通常都是不连贯的, 任何缺少的实验模式。

0

相关内容

【干货书】机器学习速查手册，135页pdf

【干货书】机器学习速查手册，135页pdf

专知会员服务

127+阅读 · 2020年11月20日

Linux导论，Introduction to Linux，96页ppt

Linux导论，Introduction to Linux，96页ppt

专知会员服务

81+阅读 · 2020年7月26日

【实用书】Python编程与解决问题，424页pdf，PROGRAMMING AND PROBLEM SOLVING WITH PYTHON

【实用书】Python编程与解决问题，424页pdf，PROGRAMMING AND PROBLEM SOLVING WITH PYTHON

专知会员服务

76+阅读 · 2020年7月12日

商业数据分析，39页ppt

商业数据分析，39页ppt

专知会员服务

165+阅读 · 2020年6月2日

Risk Sensitive Portfolio Optimization with Regime-Switching and Default Contagion，香港理工大学应用数学系余翔助理教授，第八届全国社会媒体处理大会SMP2019

Risk Sensitive Portfolio Optimization with Regime-Switching and Default Contagion，香港理工大学应用数学系余翔助理教授，第八届全国社会媒体处理大会SMP2019

专知会员服务

10+阅读 · 2019年10月24日

Aspect-Oriented Syntax Network for Aspect-Based Sentiment Analysis，中山大学数据科学与计算机学院权小军教授，第八届全国社会媒体处理大会SMP2019

Aspect-Oriented Syntax Network for Aspect-Based Sentiment Analysis，中山大学数据科学与计算机学院权小军教授，第八届全国社会媒体处理大会SMP2019

专知会员服务

19+阅读 · 2019年10月22日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

49+阅读 · 2019年10月17日

Connections between Support Vector Machines, Wasserstein distance and gradient-penalty GANs

Connections between Support Vector Machines, Wasserstein distance and gradient-penalty GANs

专知会员服务

36+阅读 · 2019年10月17日

Keras François Chollet 《Deep Learning with Python 》, 386页pdf

Keras François Chollet 《Deep Learning with Python 》, 386页pdf

专知会员服务

160+阅读 · 2019年10月12日

机器学习入门的经验与建议

机器学习入门的经验与建议

专知会员服务

94+阅读 · 2019年10月10日

已删除

将门创投

12+阅读 · 2019年7月1日

Bayesian Imputation with Optimal Look-Ahead-Bias and Variance Tradeoff

Arxiv

0+阅读 · 2022年2月2日

From Predictions to Decisions: The Importance of Joint Predictive Distributions

Arxiv

0+阅读 · 2022年2月1日

The Effect of Sample Size and Missingness on Inference with Missing Data

Arxiv

0+阅读 · 2022年2月1日

Spectral Clustering, Spanning Forest, and Bayesian Forest Process

Spectral Clustering, Spanning Forest, and Bayesian Forest Process

Arxiv

0+阅读 · 2022年2月1日

Predict and Optimize: Through the Lens of Learning to Rank

Arxiv

0+阅读 · 2022年2月1日

Provably Improving Expert Predictions with Conformal Prediction

Arxiv

0+阅读 · 2022年1月31日

Geometry- and Accuracy-Preserving Random Forest Proximities

Arxiv

0+阅读 · 2022年1月29日

Imitation by Predicting Observations

Imitation by Predicting Observations

Arxiv

4+阅读 · 2021年7月8日

Contrastive Clustering

Arxiv

31+阅读 · 2020年9月21日

Learning to Importance Sample in Primary Sample Space

Learning to Importance Sample in Primary Sample Space

Arxiv

5+阅读 · 2018年8月23日

VIP会员

文章信息

相关主题

预测器/决策函数

相关VIP内容

【干货书】机器学习速查手册，135页pdf

【干货书】机器学习速查手册，135页pdf

专知会员服务

127+阅读 · 2020年11月20日

Linux导论，Introduction to Linux，96页ppt

Linux导论，Introduction to Linux，96页ppt

专知会员服务

81+阅读 · 2020年7月26日

【实用书】Python编程与解决问题，424页pdf，PROGRAMMING AND PROBLEM SOLVING WITH PYTHON

【实用书】Python编程与解决问题，424页pdf，PROGRAMMING AND PROBLEM SOLVING WITH PYTHON

专知会员服务

76+阅读 · 2020年7月12日

商业数据分析，39页ppt

商业数据分析，39页ppt

专知会员服务

165+阅读 · 2020年6月2日

Risk Sensitive Portfolio Optimization with Regime-Switching and Default Contagion，香港理工大学应用数学系余翔助理教授，第八届全国社会媒体处理大会SMP2019

Risk Sensitive Portfolio Optimization with Regime-Switching and Default Contagion，香港理工大学应用数学系余翔助理教授，第八届全国社会媒体处理大会SMP2019

专知会员服务

10+阅读 · 2019年10月24日

Aspect-Oriented Syntax Network for Aspect-Based Sentiment Analysis，中山大学数据科学与计算机学院权小军教授，第八届全国社会媒体处理大会SMP2019

Aspect-Oriented Syntax Network for Aspect-Based Sentiment Analysis，中山大学数据科学与计算机学院权小军教授，第八届全国社会媒体处理大会SMP2019

专知会员服务

19+阅读 · 2019年10月22日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

49+阅读 · 2019年10月17日

Connections between Support Vector Machines, Wasserstein distance and gradient-penalty GANs

Connections between Support Vector Machines, Wasserstein distance and gradient-penalty GANs

专知会员服务

36+阅读 · 2019年10月17日

Keras François Chollet 《Deep Learning with Python 》, 386页pdf

Keras François Chollet 《Deep Learning with Python 》, 386页pdf

专知会员服务

160+阅读 · 2019年10月12日

机器学习入门的经验与建议

机器学习入门的经验与建议

专知会员服务

94+阅读 · 2019年10月10日

热门VIP内容

开通专知VIP会员享更多权益服务

从社会学实验到行为仿真：理解基于Agent的观点动力学建模思维

中英文版《GPT-5 System Card速览》报告

ACL 2025 | 大模型结构化知识提示的泛化能力研究

【普林斯顿博士论文】大型模型的高效推理

相关资讯

已删除

将门创投

12+阅读 · 2019年7月1日

相关论文

Bayesian Imputation with Optimal Look-Ahead-Bias and Variance Tradeoff

Arxiv

0+阅读 · 2022年2月2日

From Predictions to Decisions: The Importance of Joint Predictive Distributions

Arxiv

0+阅读 · 2022年2月1日

The Effect of Sample Size and Missingness on Inference with Missing Data

Arxiv

0+阅读 · 2022年2月1日

Spectral Clustering, Spanning Forest, and Bayesian Forest Process

Spectral Clustering, Spanning Forest, and Bayesian Forest Process

Arxiv

0+阅读 · 2022年2月1日

Predict and Optimize: Through the Lens of Learning to Rank

Arxiv

0+阅读 · 2022年2月1日

Provably Improving Expert Predictions with Conformal Prediction

Arxiv

0+阅读 · 2022年1月31日

Geometry- and Accuracy-Preserving Random Forest Proximities

Arxiv

0+阅读 · 2022年1月29日

Imitation by Predicting Observations

Imitation by Predicting Observations

Arxiv

4+阅读 · 2021年7月8日

Contrastive Clustering

Arxiv

31+阅读 · 2020年9月21日

Learning to Importance Sample in Primary Sample Space

Learning to Importance Sample in Primary Sample Space

Arxiv

5+阅读 · 2018年8月23日

微信扫码咨询专知VIP会员