Missing data is a fundamental challenge in data science, significantly hindering analysis and decision-making across a wide range of disciplines, including healthcare, bioinformatics, social science, e-commerce, and industrial monitoring. Despite decades of research and numerous imputation methods, the literature remains fragmented across fields, creating a critical need for a comprehensive synthesis that connects statistical foundations with modern machine learning advances. This work systematically reviews core concepts-including missingness mechanisms, single versus multiple imputation, and different imputation goals-and examines problem characteristics across various domains. It provides a thorough categorization of imputation methods, spanning classical techniques (e.g., regression, the EM algorithm) to modern approaches like low-rank and high-rank matrix completion, deep learning models (autoencoders, GANs, diffusion models, graph neural networks), and large language models. Special attention is given to methods for complex data types, such as tensors, time series, streaming data, graph-structured data, categorical data, and multimodal data. Beyond methodology, we investigate the crucial integration of imputation with downstream tasks like classification, clustering, and anomaly detection, examining both sequential pipelines and joint optimization frameworks. The review also assesses theoretical guarantees, benchmarking resources, and evaluation metrics. Finally, we identify critical challenges and future directions, emphasizing model selection and hyperparameter optimization, the growing importance of privacy-preserving imputation via federated learning, and the pursuit of generalizable models that can adapt across domains and data types, thereby outlining a roadmap for future research.
翻译:缺失数据是数据科学中的一个基础性挑战,严重阻碍了医疗健康、生物信息学、社会科学、电子商务和工业监测等多个领域的分析与决策。尽管经过数十年的研究并涌现出众多填补方法,相关文献仍分散于不同学科领域,亟需一个将统计学基础与现代机器学习进展相融合的综合性综述。本文系统回顾了核心概念——包括缺失机制、单次与多重填补、以及不同的填补目标——并考察了各领域中的问题特性。文章对填补方法进行了全面分类,涵盖从经典技术(如回归、EM算法)到现代方法,如低秩与高秩矩阵补全、深度学习模型(自编码器、生成对抗网络、扩散模型、图神经网络)以及大语言模型。特别关注了针对复杂数据类型的方法,例如张量、时间序列、流数据、图结构数据、分类数据以及多模态数据。除方法论外,本文还探讨了填补方法与下游任务(如分类、聚类和异常检测)的关键整合,审视了顺序处理流程与联合优化框架。综述同时评估了理论保证、基准资源及评价指标。最后,我们指出了关键挑战与未来方向,强调模型选择与超参数优化、通过联邦学习实现隐私保护填补日益增长的重要性,以及追求能够跨领域和跨数据类型适应的可泛化模型,从而为未来研究勾勒出路线图。