There are many distance-based methods for classification and clustering, and for data with a high number of dimensions and a lower number of observations, processing distances is computationally advantageous compared to the raw data matrix. Euclidean distances are used as a default for continuous multivariate data, but there are alternatives. Here the so-called Minkowski distances, $L_1$ (city block)-, $L_2$ (Euclidean)-, $L_3$-, $L_4$-, and maximum distances are combined with different schemes of standardisation of the variables before aggregating them. Boxplot transformation is proposed, a new transformation method for a single variable that standardises the majority of observations but brings outliers closer to the main bulk of the data. Distances are compared in simulations for clustering by partitioning around medoids, complete and average linkage, and classification by nearest neighbours, of data with a low number of observations but high dimensionality. The $L_1$-distance and the boxplot transformation show good results.
翻译:有很多基于远程的分类和分组方法,对于具有高维度和较低观测次数的数据,处理距离比原始数据矩阵更有利于计算。 Euclidean 距离是连续多变量数据的默认值,但也有其他选择。在这里,所谓的Minkowski 距离为$L_1美元(城市区块)-,$L_2美元(城市区块)-,$L_3美元-,$L_3美元-,$L_4美元-,最大距离与变量标准化的不同方案相结合,然后加以汇总。提议了Boxplot 转换,对一个单个变量采用新的转换方法,该变量使大多数观测标准化,但使外端接近数据的主要部分。在模拟中,通过对小类进行分隔、完整和平均连接以及按最近的邻居进行分类,对观测数量低但高度的数据进行分类,将距离进行比较。 $L_1美元距离和插件转换结果良好。