In cluster analysis, a common first step is to scale the data aiming to better partition them into clusters. Even though many different techniques have throughout many years been introduced to this end, it is probably fair to say that the workhorse in this preprocessing phase has been to divide the data by the standard deviation along each dimension. Like division by the standard deviation, the great majority of scaling techniques can be said to have roots in some sort of statistical take on the data. Here we explore the use of multidimensional shapes of data, aiming to obtain scaling factors for use prior to clustering by some method, like k-means, that makes explicit use of distances between samples. We borrow from the field of cosmology and related areas the recently introduced notion of shape complexity, which in the variant we use is a relatively simple, data-dependent nonlinear function that we show can be used to help with the determination of appropriate scaling factors. Focusing on what might be called "midrange" distances, we formulate a constrained nonlinear programming problem and use it to produce candidate scaling-factor sets that can be sifted on the basis of further considerations of the data, say via expert knowledge. We give results on some iconic data sets, highlighting the strengths and potential weaknesses of the new approach. These results are generally positive across all the data sets used.
翻译:在分组分析中,一个共同的第一步是扩大数据规模,以更好地将数据分成组群。尽管多年来为此采用了许多不同的技术,但也许可以公平地说,这一预处理阶段的工作是按每个维度的标准偏差来划分数据。与标准偏差一样,绝大多数的缩放技术可以说都源于某种统计数据。我们在这里探索如何使用多维数据形状,目的是在采用某种方法,例如K手段,明确使用样品之间的距离,来获得在集群前使用的缩放因子。我们从宇宙学和相关领域借用了最近引入的形状复杂性概念,在变量中,我们使用的是一个相对简单、数据依赖的非线性功能,我们所显示的这种功能可以用来帮助确定适当的缩放因子。我们在这里探讨的是所谓的“Midrange”距离,我们形成了一个有限的非线性编程问题,并用它来产生候选人的缩放因子集,可以根据对数据的进一步考虑进行筛选,我们通过专家知识来说明,这些是所有使用的数据结构的正面结果。