Many real-world datasets contain missing entries and mixed data types including categorical and ordered (e.g. continuous and ordinal) variables. Imputing the missing entries is necessary, since many data analysis pipelines require complete data, but this is challenging especially for mixed data. This paper proposes a probabilistic imputation method using an extended Gaussian copula model that supports both single and multiple imputation. The method models mixed categorical and ordered data using a latent Gaussian distribution. The unordered characteristics of categorical variables is explicitly modeled using the argmax operator. The method makes no assumptions on the data marginals nor does it require tuning any hyperparameters. Experimental results on synthetic and real datasets show that imputation with the extended Gaussian copula outperforms the current state-of-the-art for both categorical and ordered variables in mixed data.
翻译:许多真实世界数据集包含缺失的条目和混合数据类型,包括直线和定序(如连续和正态)变量。计算缺失条目是必要的,因为许多数据分析管道需要完整数据,但对于混合数据来说,这尤其具有挑战性。本文提出使用一个支持单一和多重估算的扩展高斯断层模型的概率估算方法。方法模型混合了使用隐性高斯分布的直线和定购数据。绝对变量的未定特性是使用Argmax操作器明确建模的。该方法对数据边缘不作假设,也不要求调整任何超强参数。合成和真实数据集的实验结果显示,对扩展高斯断层断层断层和定序变量的估算比当前混合数据中绝对变量和定置变量的状态要强。