We present DeepMVI, a deep learning method for missing value imputation in multidimensional time-series datasets. Missing values are commonplace in decision support platforms that aggregate data over long time stretches from disparate sources, and reliable data analytics calls for careful handling of missing data. One strategy is imputing the missing values, and a wide variety of algorithms exist spanning simple interpolation, matrix factorization methods like SVD, statistical models like Kalman filters, and recent deep learning methods. We show that often these provide worse results on aggregate analytics compared to just excluding the missing data. DeepMVI uses a neural network to combine fine-grained and coarse-grained patterns along a time series, and trends from related series across categorical dimensions. After failing with off-the-shelf neural architectures, we design our own network that includes a temporal transformer with a novel convolutional window feature, and kernel regression with learned embeddings. The parameters and their training are designed carefully to generalize across different placements of missing blocks and data characteristics. Experiments across nine real datasets, four different missing scenarios, comparing seven existing methods show that DeepMVI is significantly more accurate, reducing error by more than 50% in more than half the cases, compared to the best existing method. Although slower than simpler matrix factorization methods, we justify the increased time overheads by showing that DeepMVI is the only option that provided overall more accurate analytics than dropping missing values.
翻译:深度MVI是多维时间序列数据集中缺失值估算的深学习方法。 缺失值在决策支持平台中很常见, 决策支持平台长期汇总来自不同来源的数据, 而可靠的数据分析则要求谨慎处理缺失数据。 一种策略是估算缺失值, 并且存在多种多样的算法, 包括简单的内插法、 诸如 SVD 等矩阵因子化方法、 诸如 Kalman 过滤器等统计模型和最近的深层学习方法。 我们显示, 与仅仅排除缺失数据相比, 缺失值往往在总体分析中提供更差的结果。 深MVI 使用神经网络将精细的和粗粗粗的模型与一个时间序列结合起来, 可靠的数据分析要求从一个直截面上对相关序列的趋势进行。 一种策略是估算缺失值缺失值, 而多种算法, 包括一个具有新颖的共振窗口特性的时变器, 以及最近深层内嵌式的内核回归方法。 参数及其培训的精心设计, 是为了将各种缺失区块和数据流的特征进行概括化。 深深层VIVI 相比, 更精确的实验比现有五进式方法比现有五进式方法要显示, 更精确地显示, 最深层方法比现有五进化方法显示, 错误比现有五进式方法比现有错误比现有方法要少得多。