In cluster analysis, a common first step is to scale the data aiming to better partition them into clusters. Even though many different techniques have throughout many years been introduced to this end, it is probably fair to say that the workhorse in this preprocessing phase has been to divide the data by the standard deviation along each dimension. Like the standard deviation, the great majority of scaling techniques can be said to have roots in some sort of statistical take on the data. Here we explore the use of multidimensional shapes of data, aiming to obtain scaling factors for use prior to clustering by some method, like k-means, that makes explicit use of distances between samples. We borrow from the field of cosmology and related areas the recently introduced notion of shape complexity, which in the variant we use is a relatively simple, data-dependent nonlinear function that we show can be used to help with the determination of appropriate scaling factors. Focusing on what might be called "midrange" distances, we formulate a constrained nonlinear programming problem and use it to produce candidate scaling-factor sets that can be sifted on the basis of further considerations of the data, say via expert knowledge. We give results on a few iconic data sets, highlighting the strengths and potential weaknesses of the new approach. These results are generally positive across all the data sets used.
翻译:在分组分析中,一个共同的第一步是扩大数据规模,以更好地将数据分成组群。尽管多年来为此采用了许多不同的技术,但也许可以公平地说,这一预处理阶段的工作是按每个层面的标准偏差来划分数据。与标准偏差一样,绝大多数比例化技术可以说都源于某种统计数据。我们在这里探索如何使用多维数据形状,目的是在采用某种方法,例如K手段,明确使用样本之间的距离来分组之前获得使用比例化因素。我们从宇宙学和相关领域借用最近引入的形状复杂性概念,在变量中,我们使用的是一个相对简单、数据依赖的非线性功能,我们显示,这种功能可以用来帮助确定适当的比例化因素。我们在这里探讨的是所谓的“米差”距离,我们形成了一个有限的非线性编程问题,并用它来产生候选人比例化数据集,这些数据集可以根据对数据的进一步考虑进行筛选,例如专家知识,我们借用了这些模型来说明,我们通常使用这些模型的弱点和所有结果。