The degree to which subjects differ from each other with respect to certain properties measured by a set of variables, plays an important role in many statistical methods. For example, classification, clustering, and data visualization methods all require a quantification of differences in the observed values. We can refer to the quantification of such differences, as distance. An appropriate definition of a distance depends on the nature of the data and the problem at hand. For distances between numerical variables, there exist many definitions that depend on the size of the observed differences. For categorical data, the definition of a distance is more complex, as there is no straightforward quantification of the size of the observed differences. Consequently, many proposals exist that can be used to measure differences based on categorical variables. In this paper, we introduce a general framework that allows for an efficient and transparent implementation of distances between observations on categorical variables. We show that several existing distances can be incorporated into the framework. Moreover, our framework quite naturally leads to the introduction of new distance formulations and allows for the implementation of flexible, case and data specific distance definitions. Furthermore, in a supervised classification setting, the framework can be used to construct distances that incorporate the association between the response and predictor variables and hence improve the performance of distance-based classifiers.
翻译:就一组变量衡量的某些属性而言,不同主题之间的不同程度在许多统计方法中起着重要作用。例如,分类、分组和数据可视化方法都要求对观察到的值的差异进行量化。我们可以提及对差异的量化,例如距离。对距离的适当定义取决于数据的性质和手头的问题。对于数字变量之间的距离,有许多取决于观察到的差异大小的定义。对于绝对数据,距离的定义更为复杂,因为所观察到的差异的规模没有直接的量化。因此,有许多建议可以用来衡量基于绝对变量的差异。在本文件中,我们引入了一个总框架,允许高效率和透明地执行对绝对变量的观察之间的距离。我们表明,现有的一些距离可以被纳入框架。此外,我们的框架相当自然地导致引入新的距离配方,并允许实施灵活、个案和数据特定的距离定义。此外,在监督下的分类设置中,可以使用框架来构建距离,将反应和预测变量之间的关联纳入,从而改进以距离为基础的变量的性能。