Missing value imputation is crucial for real-world data science workflows. Imputation is harder in the online setting, as it requires the imputation method itself to be able to evolve over time. For practical applications, imputation algorithms should produce imputations that match the true data distribution, handle data of mixed types, including ordinal, boolean, and continuous variables, and scale to large datasets. In this work we develop a new online imputation algorithm for mixed data using the Gaussian copula. The online Gaussian copula model meets all the desiderata: its imputations match the data distribution even for mixed data, improve over its offline counterpart on the accuracy when the streaming data has a changing distribution, and on the speed (up to an order of magnitude) especially on large scale datasets. By fitting the copula model to online data, we also provide a new method to detect change points in the multivariate dependence structure with missing values. Experimental results on synthetic and real world data validate the performance of the proposed methods.
翻译:缺失的估算值对于真实世界数据科学工作流程至关重要。 估算值在在线环境中比较困难, 因为它要求估算方法本身能够随着时间的演变而演变。 对于实际应用, 估算算法应该产生与真实数据分布相匹配的估算值, 处理混合类型的数据, 包括圆形、 布伦和连续变量, 以及大数据集的尺度。 在这项工作中, 我们开发了一个新的在线估算算法, 用于使用高斯立方体( Gaussian coupula) 的混合数据 。 在线 Gaussian Copula 模型满足了所有淡化数据 : 其估算值甚至匹配混合数据分布, 在流数据分布发生变化时, 以及速度( 达到一个数量级) 上, 特别是在大型数据集上, 其速度( 达到一个数量级) 上, 我们还提供了一种新的方法来检测多变量依赖性结构中缺少值的变化点 。 合成和真实世界数据的实验结果验证了拟议方法的绩效 。