Data values in a dataset can be missing or anomalous due to mishandling or human error. Analysing data with missing values can create bias and affect the inferences. Several analysis methods, such as principle components analysis or singular value decomposition, require complete data. Many approaches impute numeric data and some do not consider dependency of attributes on other attributes, while some require human intervention and domain knowledge. We present a new algorithm for data imputation based on different data type values and their association constraints in data, which are not handled currently by any system. We show experimental results using different metrics comparing our algorithm with state of the art imputation techniques. Our algorithm not only imputes the missing values but also generates human readable explanations describing the significance of attributes used for every imputation.
翻译:数据集中的数据值可能因处理不当或人为错误而丢失或异常。 以缺失值分析数据可造成偏差并影响推论。 几种分析方法, 如原则元件分析或单值分解等, 需要完整的数据。 许多方法将数值数据与某些方法不考虑属性对其他属性的依赖性, 而有些方法则需要人类的干预和域知识。 我们为数据估算提供了一种新的算法, 其依据是不同的数据类型值及其在数据中的关联限制, 目前任何系统都没有处理这些值。 我们用不同的尺度来比较我们的算法和最先进的估算技术, 显示实验结果。 我们的算法不仅估算了缺失值, 而且还生成了可读的人类解释, 描述每次估算所使用的属性的重要性 。