Nearest neighbor methods have become popular in official statistics, mainly in imputation or in statistical matching problems; they play a key role in machine learning too, where a high number of variants have been proposed. The choice of the distance function depends mainly on the type of the selected variables. Unfortunately, relatively few options permit to handle mixed type variables, a situation frequently encountered in official statistics. The most popular distance for mixed type variables is derived as the complement of the Gower's similarity coefficient; it is appealing because ranges between 0 and 1 and allows to handle missing values. Unfortunately, the unweighted standard setting the contribution of the single variables to the overall Gower's distance is unbalanced because of the different nature of the variables themselves. This article tries to address the main drawbacks that affect the overall unweighted Gower's distance by suggesting some modifications in calculating the distance on the interval and ratio scaled variables. Simple modifications try to attenuate the impact of outliers on the scaled Manhattan distance; other modifications, relying on the kernel density estimation methods attempt to reduce the unbalanced contribution of the different types of variables. The performance of the proposals is evaluated in simulations mimicking the imputation of missing values through nearest neighbor distance hotdeck method.
翻译:在官方统计中,最近邻的方法已经受到欢迎,主要是估算或统计匹配问题;它们也在机器学习中发挥着关键作用,因为提出了大量变量。选择距离函数主要取决于选定变量的类型。不幸的是,相对较少的选项允许处理混合类型变量,这是官方统计中经常遇到的一种情况。混合类型变量最受欢迎的距离是作为戈尔相似系数的补充而得出的;它具有吸引力,因为范围介于0:1之间,可以处理缺失的值。不幸的是,由于变量本身的性质不同,确定单一变量对整个戈尔距离的贡献的未加权标准是不平衡的。本文章试图解决影响总体未加权戈尔距离的主要偏差,建议对计算间隔和比例缩放变量的距离作一些修改。简单修改试图减轻外部变量对曼哈顿距离的冲击;其他修改,依靠内核密度估计方法,试图减少不同变量类型之间的不平衡贡献。在模拟中,正在评估最接近的距离方法。