Missing data is common in datasets retrieved in various areas, such as medicine, sports, and finance. In many cases, to enable proper and reliable analyses of such data, the missing values are often imputed, and it is necessary that the method used has a low root mean square error (RMSE) between the imputed and the true values. In addition, for some critical applications, it is also often a requirement that the logic behind the imputation is explainable, which is especially difficult for complex methods that are for example, based on deep learning. This motivates us to introduce a conditional Distribution based Imputation of Missing Values (DIMV) algorithm. This approach works based on finding the conditional distribution of a feature with missing entries based on the fully observed features. As will be illustrated in the paper, DIMV (i) gives a low RMSE for the imputed values compared to state-of-the-art methods under comparison; (ii) is explainable; (iii) can provide an approximated confidence region for the missing values in a given sample; (iv) works for both small and large scale data; (v) in many scenarios, does not require a huge number of parameters as deep learning approaches and therefore can be used for mobile devices or web browsers; and (vi) is robust to the normally distributed assumption that its theoretical grounds rely on. In addition to DIMV, we also introduce the DPER* algorithm improving the speed of DPER for estimating the mean and covariance matrix from the data, and we confirm the speed-up via experiments.
翻译:缺少的数据在医学、体育和金融等不同领域检索的数据集中很常见。 在许多情况下,为了能够对此类数据进行正确和可靠的分析,缺失的数值往往被估算,而且使用的方法在估算值和真实值之间必须有一个低根平均正方差(RMSE),此外,对于某些关键应用,也往往要求估算的逻辑可以解释,而对于基于深层次学习的复杂方法来说,这一点特别困难。 这促使我们引入一个基于缺失值的有条件分配计算值(DIMV)算法。这一方法基于根据完全观察的特征找到一个带有缺失条目的特性的有条件分布。正如在文件中将说明的那样,DIMV (一) 与比较中的最新方法相比,对估算值的逻辑值提供较低的RMSE; (二) 可以解释; (三) 能够为某个样本中缺失值提供一个大致的可信度区域; (四) 从小和大比例数据* 工作,基于完全观察的参数; (五) 通常,使用移动模型的参数,不需要一个巨大的数字, 也就是我们用来进行计算。